cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
115 stars 6 forks source link
bioinformatics structural-variation transposons

status: support status: paper nextflow

docker apptainer

πŸ—žοΈ The GraffiTE paper is now out!

Description

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies or long read datasets and genotypes the discovered polymorphisms in read sets using a pangenomic approach. GraffiTE is developed by Cristian Groza and ClΓ©ment Goubert in Guillaume Bourque's group at the Genome Centre of McGill University (MontrΓ©al, Canada). GraffiTE is based on the concept developped in Groza et al., 2022.

  1. First, each genome assembly or long read dataset is aligned to the reference genome with minimap2, alternatively, winnowmap is available. For each sample considered, structural variants (SVs) are called with svim-asm if using assemblies or sniffles2 if using long reads and only insertions and deletions relative to the reference genome are kept.

  2. Candidate SVs (INS and DEL) are scanned with RepeatMasker, using a user-provided library of repeats of interest (.fasta). SVs covered β‰₯80% by repeats are kept. At this step, target site duplications (TSDs) are searched for SVs representing a single TE family.

  3. Each candidate repeat polymorphism is induced in a graph-genome where TEs and repeats are represented as bubbles, allowing reads to be mapped on either presence of absence alleles with Pangenie, Giraffe or GraphAligner.


⚠️ Bug/issues as well as comments and suggestions are welcomed in the Issue section of this Github.


Changelog

Last update: 11/07/24 | commit: 76537f9

Previous update: 10/22/24 | commit: 47ad044

Thank you @Han-Cao for submitting a pull request:

10/21/24 update:

- :beetle: bug fix: transform RepeatMasker coordinates from 1-based to 0-based in order to meet the bed format standard and measure accurate hit length. This fixes [issue #43](https://github.com/cgroza/GraffiTE/issues/43)

06/24/24 update:

- New option `--break_scaffolds` (see [additional parameters](#additional-parameters)) that automatically split contigs at runs of N > 4. With some scaffolded genomes, minimap2 can indeed return an error related to some CIGAR string being too long, typically `[E::parse_cigar] CIGAR length too long at position ...`. Breaking scaffolds at N stretches typicaly solve this problem, caused by limitations of the `htslib`/SAM specification.

06/17/24 update:

- Added new/alternative compatible classes names: MITE, TIR and IS. e.g.: `>TEnameX#MITE` `>TEnameY#TIR/Mariner` or `>TEnameX#IS`. In previous versions, TE named with these classes were discarded by `OneCodeToFindThemAll` - The compatible classes in the fasta header includes (i.e. `Class` in `>TEname#Class/Superfamily`): `LINE`, `LTR`, `SINE`, `RC/Helitron` (will be treated as `DNA/RC`), `DNA`, `TIR`, `MITE`, `Retroposon`, `IS`, `Unknown`, `Unspecified` - TE for which a classification is absent will be treated as `Unknown` (e.g. `>TEnameZ`) - All `>TEnames` and `Superfamily` will be accepted as long as the `Class` name is among those supported.

02/13/24 update:

- Since > beta 0.2.5 we switched versioning to commit id. Please refer to the commit ID of the version of GraffiTE you are using if you need support. - :beetle: bug fix: recently, the L1 inversion flag was not working (`--mammal`). It has now been fixed. - Winnowmap is now available as an alternative mapper instead of Minimap2. To enable Winnowmap, use the flag `--aligner winnowmap`; default remains minimap2.

beta 0.2.5 (09-11-23):

- :beetle: bug fix: fix a VCF annotation issue that was happening when two distinct variants shared the same VCF POS field. Annotations are now distinct depending on the variant sequence. - cleanup GraphAligner VCF outputs for clarity.

beta 0.2.4 (06-27-23):

- Refactored `GraffiTE` to use the DSL2 Nextflow syntax.

beta 0.2.3 (02-21-22):

- :new: feature: You can now perform the initial SV search from both assemblies and long-read together. The variants discovered with each method will be merged together for the filtering and genotyping. - :new: parameters with defaults added to control time, cpu and memory for each process. This is useful to manage cluster requests when `-profile cluster` is used. - :beetle: bug fix: merging of variant now only occurs for the same SVTYPE flag (INS or DEL).

beta 0.2.2 (02-01-22):

- :new: feature: adds `sniffles2` as an alternative to `svim-asm` in order to start SV search from long reads (instead of a genomic assembly). - Using the parameter `--longreads` instead of `--assembly` (see inputs) will prompt `GraffiTE` to use `sniffles2` - For now, `svim-asm` and `sniffles2` pipeline are separated (either `--longreads` or `--assembly`. We will soon allow to merge the findings of both callers before filtering for repeats. - :new: feature: adds a divergence preset option to `minimap2` ahead of `svim-asm`. Use the flag `--asm_divergence `. Defaults is `asm5` (< 5% expected divergence between assembly and reference genome). [See minimap2 documentation](https://lh3.github.io/minimap2/minimap2.html). - :new: `time`, `cpu` and `memory` directives options added to control the resources needed for each `GraffiTE` process. Useful to optimize scheduler requests while using the `cluster` profile of `GraffiTE`. See details here.

beta 0.2.1 (11-30-22 - click to drop-down details):

- :new: feature: adds `--RM_vcf` and `--RM_dir` input options. Allows to start a run directly at the TSD search step by providing the VCF and `repeatmasker_dir` produced by the processes `repeatmasker` or `repeatmasker_fromVCF` (found in the output folder `2_Repeat_Filtering`). This is useful if a run crashed during any of the TSD search processes and the job is not recoverable by Nextflow. Providing `--RM_vcf` and `--RM_dir` will bypass SV calling with `minimap2/svim_asm` (`svim_asm` process) and `repeatmasker/repeatmasker_fromVCF` processes. - :beetle: bug fix: TSD search is now performed by batches of 100 variants, which will reduce by a factor 100 the number of temporary working directories (which can cause storage to run over inodes' quota). If more than 100 variants are present, TSDs will be searched in parallel batches (up to the number of available CPUs).

beta 0.2 (11-11-22 - click to drop-down details):

- :new: feature: adds two new read aligners: [`giraffe`](https://github.com/vgteam/vg#mapping) (short read optimized, works also with long-reads) and [`graphAligner`](https://github.com/maickrau/GraphAligner) (long-read, error-prone compliant). - usage: `--graph_method [pangenie/giraffe/graphaligner]` default: `pangenie` (short accurate reads) - :new: feature: adds `--vcf` input option: requires a sequence resolved (REF and ALT allele sequences in VCF). Will bypass genome alignments and proceed with repeat annotations, TSD search, and reads mapping (optional). - :new: feature: adds `--graffite_vcf` input option: requires a VCF created by `GraffiTE` (in the outputs `3_TSD_search/pangemome.vcf`). Will skip all steps but read mapping. - :beetle: bug fix: remove the dependency to `biomartr`

beta 0.1 (11-02-22 - click to drop-down details):

- first release

It is required to update both the repository (git pull) and image to see changes


Workflow

Installation

Prerequisites

GraffiTE is a Nextflow pipeline, with all the dependencies wrapped in an Apptainer image. It is thus compatible with any Linux system including HPCs.

Note that we have received report that Apptainer installation with Conda can cause issues. We recommend to install Apptainer directly.

GraffiTE install

apptainer remote add --no-login SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloud

Important note

We are aware of a common issue araising when the pipeline call a temporary directory (/tmp). The most common symptom is that though the program may complete without error, it skips over "tsd_search" and "tsd_report". The program will not produce a vcf file (3_TSD_search/pangenome.vcf) and the vcf in 2_Repeat_Filter has no variants. While we will try to fix this in a next update, an easy fix is to ammend the nextflow.config file as follow.

  1. Locate the file:

    • Either in ~/.nextflow/assets/cgroza/GraffiTE/nextflow.config
    • or in the cloned GitHub repository.
  2. Ammend the file:

replace:

singularity.runOptions = '--contain'

with

singularity.runOptions = '--contain -B <path-to-writable-dir>/:/tmp'

replace <path-to-writable-dir> with any writable path on your host machine

Running GraffiTE

nextflow run cgroza/GraffiTE \
   --assemblies assemblies.csv \
   --TE_library library.fa \
   --reference reference.fa \
   --graph_method pangenie \
   --reads reads.csv
nextflow run <path-to-install>/GraffiTE/main.nf \
   --assemblies assemblies.csv \
   --TE_library library.fa \
   --reference reference.fa \
   --reads reads.csv [-with-singularity <your-path>/graffite_latest.sif]

As a Nextflow pipeline, commad line arguments for GraffiTE can be distinguished between pipeline-related commands, prefixed with -- such as --reference and Nextflow-specific commands, prefixed with - such as -resume (see Nextflow documentation).

A small test set is included in the test/human_test_set.tar.gz file. Download and decompress the file and run:

nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta

This will show a complete run of the GraffiTE pipeline, with the output stored in out.

Software Update

To make sure that you are running the latest version of GraffiTE, you can update the pipeline using the following command:

nextflow pull -r main https://github.com/cgroza/GraffiTE

In case nextflow returns the following error:

Checking https://github.com/cgroza/GraffiTE ...
cgroza/GraffiTE contains uncommitted changes -- cannot pull from repository

You'll need to first delete the cached, older verison like so:

rm -rf ~/.nextflow/assests/cgroza/GraffiTE/
nextflow pull -r main https://github.com/cgroza/GraffiTE

Parameters

Input files

AND/OR

AND (always required)

Additional parameters

Pipeline Shortcuts

These parameters can be used to bypass different steps of the pipeline.

Process-specific parameters

SV detection with svim-asm (from assemblies)
SV detection with sniffles2 (from long reads)
SV Annotation (RepeatMasker)
Genotyping with Pangenie
Genotyping with Giraffe, GraphAligner and vg call

GraffiTE modes

In the main publication of GraffiTE, we refer to different "modes" relative to the different combination of assembly, long-reads (for discovery) and reads (for genotyping). The following table recapitulate the arguments to use in order to repliate these modes. Please refer the the reads file description above for proper formating.

Mode Arguments Description
GT-sv --assemblies --genotype false pME discovery from assemblies
GT-sn --longreads --genotype false pME discovery from long reads
GT-svsn --assemblies --longreads --genotype false pME discovery from both assemblies and long reads
GT-sv-PG --assemblies --reads pME discovery from assemblies and genotyping from short reads with Pangenie
GT-sn-PG --longreads --reads pME discovery from long reads and genotyping from short reads with Pangenie
GT-svsn-PG --assemblies --longreads --reads pME discovery from both assemblies and short reads and genotyping from short reads with Pangenie
GT-sv-GA --assemblies --reads --graph_method graphaligner pME discovery from assemblies and genotyping from long reads with GraphAligner
GT-sn-GA --longreads --reads --graph_method graphaligner pME discovery from long reads and genotyping from long reads with GraphAligner
GT-svsn-GA --assemblies --longreads --reads --graph_method graphaligner pME discovery from both assemblies and short reads and genotyping from long reads with GraphAligner

Nextflow parameters

Nextflow-specific parameters can be passed in addition to those presented above. These parameters can be distinguished by the use of a single -, such as -resume. See Nextflow documentation for more details.

Outputs

The results of GraffiTE will be produced in a designated folder with the option --out. The output folder contains up to 4 sub-folders (3 if --genotype false is set). Below is an example of the output folder using two alternative assemblies of the human chromosome 1 (maternal and paternal haplotypes of HG002) and two read-sets from HG002 for genotyping.

OUTPUT_FOLDER/
β”œβ”€β”€ 1_SV_search
β”‚Β Β  β”œβ”€β”€ HG002_mat.vcf
β”‚Β Β  └── HG002_pat.vcf
β”œβ”€β”€ 2_Repeat_Filtering
β”‚Β Β  β”œβ”€β”€ genotypes_repmasked_filtered.vcf
β”‚Β Β  └── repeatmasker_dir
β”‚Β Β      β”œβ”€β”€ ALL.onecode.elem_sorted.bak
β”‚Β Β      β”œβ”€β”€ indels.fa.cat.gz
β”‚Β Β      β”œβ”€β”€ indels.fa.masked
β”‚Β Β      β”œβ”€β”€ indels.fa.onecode.out
β”‚Β Β      β”œβ”€β”€ indels.fa.out
β”‚Β Β      β”œβ”€β”€ indels.fa.out.length
β”‚Β Β      β”œβ”€β”€ indels.fa.out.log.txt
β”‚Β Β      β”œβ”€β”€ indels.fa.tbl
β”‚Β Β      β”œβ”€β”€ onecode.log
β”‚Β Β      └── OneCode_LTR.dic
β”œβ”€β”€ 3_TSD_search
β”‚Β Β  β”œβ”€β”€ pangenome.vcf
β”‚Β Β  β”œβ”€β”€ TSD_full_log.txt
β”‚Β Β  └── TSD_summary.txt
└── 4_Genotyping
    β”œβ”€β”€ GraffiTE.merged.genotypes.vcf
    β”œβ”€β”€ HG002_s1_10X_genotyping.vcf.gz
    β”œβ”€β”€ HG002_s1_10X_genotyping.vcf.gz.tbi
    β”œβ”€β”€ HG002_s2_10X_genotyping.vcf.gz
    └── HG002_s2_10X_genotyping.vcf.gz.tbi

Note that intermediate files will be written in the ./work folder created by Nextflow. Each Nextflow process is run in a separate working directory. If an error occurs, Nextflow will points to the specific working directory. Moreover, it is possible to resume interrupted jobs if the ./work folder is intact and you use the same command, plus the -resume (1 single -) tag after your command. It is recommended to delete the ./work folder regularly to avoid storage issues (more than space, it can aggregate a LOT of files through time). More info about Nextflow usage can be found here.

Output VCFs

GraffiTE outputs variants in the VCF 4.2 format. Additional fields are added in the INFO column of the VCF to annotate SVs containing TEs and other repeats (3_TSD_Search/pangenie.vcf [do not contain individual genotypes, only the list of variants] and 4_Genotyping/GraffiTE.merged.genotypes.vcf which contains a genotype column for each reads-set).

1  33108378 HG002_pat.svim_asm.INS.206 T  TTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCACCAGACTGGAGTACAATGGCACAATCTCGGCTTACTGCAACTTCCGCCTCCTGGGTTCAAGCAATTCCCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGACGTGTGCCACCATGCCTGGCTAATTTTTTGTATTTTA
GCAGAGACGGAGTTTCACCATGTTGGCCAGGATGCTCTCAATCTCCTTACCTCATGATCCGCCAGCCTCGGCCTCCCAAAGTGCTGGGATTATTACAGGCATGAGCCACAGTCCCAGGTCTTTAGACAAACTCAACCCATTATCAATCAAAAAATGTTTAAATTCACTTATAGCATGGAAGCTACCCCACCCCTCCCCCCTCCCCCCTCCCGCCCCCCCCAGCTTTGAGTTGTCCCACCTTTCTGGACCAAAGCA ATGTATTTCTTAAACTTAATTGATTAATGTCTCATGCCTCTCTGAAATGTATAAAACCAAACTGTGCCCTGACCACCTTGGGCACACTGAGCACATGTTCTCAGGATCTCCAGAGGGCTGTGTCAGGGGCCATGGTCACATTTGGCTCAGAATACATCTCTTCAAATATTTTATAGAGTTCGACTATTTTGTCAACAATTAAAAAGGCACCTATTCAGAAT
ATTAAAAGTTAAGATTTAATAACATCAACAGTTCTTACTGATTCATCAAATATTTTTTTTTTTGAGACCGAGTCTCGCTCTATCGCCCAGGCTGGAGGGCAGTGGCACAATCTCTGTTCACTGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAATAGCTGGGACTACATGCGCGTGCCACCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACGGAGTTTCACAACGTTGGCCAGGATGGTCTCGATCCCTTGACCTCATGATCCGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCGGCGCCTGGCCAAAACAAAA  .PASS K=301;MA=0;AF=0.5;AK=2,299;CIEND=0,1;CIPOS=0,0;CHR2=1;END=33108378;SVLEN=1002;SVMETHOD=SURVIVOR1.0.7;SVTYPE=INS;SUPP_VEC=11;SUPP=2;STRANDS=+-;n_hits=4;match_lengths=293,331,80,291;repeat_ids=AluSc8,MER4E1,Charlie1a,AluSc;
matching_classes=SINE/Alu,LTR/ERV1,DNA/hAT-Charlie,SINE/Alu;fragmts=1,1,1,1;RM_hit_strands=C,+,C,C;RM_hit_IDs=28269,28270,28271,28272;total_match_length=991;total_match_span=0.988036;mam_filter_1=None;mam_filter_2=None   GT:GQ:GL:KC 1/1:10000:-450.343,-147.4,0:4 1/1:10000:-450.343,-147.4,0:4

A more complex example with n_hit=4

VCF column:

When using Giraffe and GraphAligner with vg call, the following fields are also present:

TSD module

For SVs with a single TE insertion detected (n_hits=1, and LINE1s with the flag mam_filter_1=5P_INV) target site duplication are searched by comparing the flanking regions following this workflow:

The script also account for the presence of poly-A/T

Mammalian filters --mammal

In order to account for the particularities of several TE families, we have introduced a --mammal flag that will search for specific features associated with mammalian TEs. So far we are accounting for two particular cases: 5' Inversion of L1 elements and VNTR polymorphism between orthologous SVA insertions. We will try to add more of these filters, for example to detect solo vs full-length LTR polymorphisms. If you would like to see more of these filters, please share your suggestions on the Issue page!

L1 5' inversion

SV detected by GraffiTE and corresponding to non-canonical TPRT (Target Primed Reverse Transcription), such as Twin Priming (see here and here) may be skipped by the TSD script because it artificially creates 2 hits instead of one for a single TE insert.

Whether or not the L1 is inserted on the + or - strand, at Twin-Primed L1 will have the same pattern with RepeatMasker:

This is because an inversion on the - strand feature will look like + on the consensus ((-)*(-) = (+) or a "reverted reverse")

However, we can differentiate the two based on the coordinates of the hit on the TE consensus (cartoon not to scale to compare two L1 insertions with the same consensus):

For each pair (C,+) of hits, we look at the target hit coordinates:

L1 inversions will be reported with the flag mam_filter_1=5P_INV in the INFO field of the VCFs.

VNTR polymorphisms in SVA elements

drawing

If GraffiTE detects:

The variant will be flagged with mam_filter_2=VNTR_ONLY:SVA_F:544:855 with SVA_F:544:855 varying according to the element family and VNTR region:

SVA model VNTR period size Repeat # start end
SVA_A 37 10.5 436 855
SVA_B 37 10.8 431 867
SVA_C 37 10.5 432 851
SVA_D 37 6.4 432 689
SVA_E 37 10.8 428 864
SVA_F 37 10.5 435 857

GraffiTE execution profiles

By default, the pipeline will inherit the nextflow configuration and run accordingly. To execute locally, on SLURM, or AWS, pass one of the -profile provided with the GraffiTE:

For example,

nextflow run cgroza/GraffiTE -profile cluster ...

will run on SLURM.

Specifying memory and CPU allocation at each step

You may alter the following parameters on the command line or in your own nextflow configuration file to change how many CPUs and how much memory will be required by each step.

The requirements are numbers or strings accepted by nextflow. For example, 40 for number of CPUs and '100G' for memory.

Resource usage examples:

Model species Ref genome size / TE content (%) Input sample (if applicable) process # of processes measured CPUs (available) RAM (peak) Process run time
human 3 Gbp / 50% haploid assemblies minimap2 2 40 46-47Gb 38-49 mn
5X HiFi long reads minimap2 1 40 65Gb 20 mn
10X HiFi long reads minimap2 1 40 78Gb 39 mn
20X HiFi long reads minimap2 1 40 99Gb 1h 14 mn
30X HiFi long reads minimap2 1 40 109Gb 1h 48 mn
VCF RepeatMasker 1 40 2Gb 37 mn
VCF make_graph (vg) 2 1 4-8Gb 6 mn
30X HiFi long reads GraphAligner 2 40 124-125Gb 4h 1 mn
30X Illumina Pangenie 2 40 85-87Gb 46 mn
C. sativa 740 Mbp / 70% haploid assemblies minimap2 9 40 56-118Gb 12 mn-1h 31 mn
Long reads (pb, hifi or ONT) minimap2 5 40 58-93Gb 3h 18 mn-9h 24 mn
VCF RepeatMasker 1 40 5Gb 7h 24 mn
Z. mays 2.4 Gbp / 85% assembly minimap2 1 40 143G 74.4 mn
70X PacBio longreads minimap2 1 40 57G 13h 42 mn
VCF RepeatMasker 1 40 31G 203 mn
VCF make_graph (vg) 1 40 11G 4 mn
70X PacBio longreads GraphAligner 1 40 66 G 11h 25 mn
Algined GAM vg call 1 40 12 G 8 mn

Large, complex and highly repetitive genomes

The default parameters, in particular for request of RAM and execution time, may be inssuficient for large, complex and repeat rich genomes such as Maize and other models. Nextflow's error message may be hard to interpret and sometimes misleading with regards to the actual cause of the error. We advise users that suspect their model to be challenging for GraffiTE to initially use as much ressource are necessary. So far, a higher bound of 120h and 400Gb per process (these being requested resources, not actual usage -- for actual usage, see above) have been reported to allow successful run with Maize models, the most ressource intensive step being long-read alignments.

Known Issues / Notes / FAQ

Please use and abuse the Issue section of this Github page. With the userbase growing, it becomes more likely that someone else has already encountered a similar issue. If not, other users will benefits from your experience! We will try to respond swiftly to any help request, and the Issue page is the only place we actively monitor for user support. Thank you!

Cite

Groza, C., Chen, X., Wheeler, T.J. et al. A unified framework to analyze transposable element insertion polymorphisms using graph genomes. Nat Commun 15, 8915 (2024). https://doi.org/10.1038/s41467-024-53294-2