ποΈ The GraffiTE paper is now out!
GraffiTE
is a pipeline that finds polymorphic transposable elements in genome assemblies or long read datasets and genotypes the discovered polymorphisms in read sets using a pangenomic approach. GraffiTE
is developed by Cristian Groza and ClΓ©ment Goubert in Guillaume Bourque's group at the Genome Centre of McGill University (MontrΓ©al, Canada). GraffiTE
is based on the concept developped in Groza et al., 2022.
First, each genome assembly or long read dataset is aligned to the reference genome with minimap2
, alternatively, winnowmap
is available. For each sample considered, structural variants (SVs) are called with svim-asm
if using assemblies or sniffles2
if using long reads and only insertions and deletions relative to the reference genome are kept.
Candidate SVs (INS and DEL) are scanned with RepeatMasker
, using a user-provided library of repeats of interest (.fasta). SVs covered β₯80% by repeats are kept. At this step, target site duplications (TSDs) are searched for SVs representing a single TE family.
Each candidate repeat polymorphism is induced in a graph-genome where TEs and repeats are represented as bubbles, allowing reads to be mapped on either presence of absence alleles with Pangenie
, Giraffe
or GraphAligner
.
β οΈ Bug/issues as well as comments and suggestions are welcomed in the Issue section of this Github.
Last update: 11/07/24 | commit: 76537f9
--tsd_time
option to specify the time request for the TSD modules when using cluster
profile. Default remains 1h
. No need to update the image, simply pull this Github repository.Previous update: 10/22/24 | commit: 47ad044
Thank you @Han-Cao for submitting a pull request:
- :beetle: bug fix: transform RepeatMasker coordinates from 1-based to 0-based in order to meet the bed format standard and measure accurate hit length. This fixes [issue #43](https://github.com/cgroza/GraffiTE/issues/43)
- New option `--break_scaffolds` (see [additional parameters](#additional-parameters)) that automatically split contigs at runs of N > 4. With some scaffolded genomes, minimap2 can indeed return an error related to some CIGAR string being too long, typically `[E::parse_cigar] CIGAR length too long at position ...`. Breaking scaffolds at N stretches typicaly solve this problem, caused by limitations of the `htslib`/SAM specification.
- Added new/alternative compatible classes names: MITE, TIR and IS. e.g.: `>TEnameX#MITE` `>TEnameY#TIR/Mariner` or `>TEnameX#IS`. In previous versions, TE named with these classes were discarded by `OneCodeToFindThemAll` - The compatible classes in the fasta header includes (i.e. `Class` in `>TEname#Class/Superfamily`): `LINE`, `LTR`, `SINE`, `RC/Helitron` (will be treated as `DNA/RC`), `DNA`, `TIR`, `MITE`, `Retroposon`, `IS`, `Unknown`, `Unspecified` - TE for which a classification is absent will be treated as `Unknown` (e.g. `>TEnameZ`) - All `>TEnames` and `Superfamily` will be accepted as long as the `Class` name is among those supported.
- Since > beta 0.2.5 we switched versioning to commit id. Please refer to the commit ID of the version of GraffiTE you are using if you need support. - :beetle: bug fix: recently, the L1 inversion flag was not working (`--mammal`). It has now been fixed. - Winnowmap is now available as an alternative mapper instead of Minimap2. To enable Winnowmap, use the flag `--aligner winnowmap`; default remains minimap2.
- :beetle: bug fix: fix a VCF annotation issue that was happening when two distinct variants shared the same VCF POS field. Annotations are now distinct depending on the variant sequence. - cleanup GraphAligner VCF outputs for clarity.
- Refactored `GraffiTE` to use the DSL2 Nextflow syntax.
- :new: feature: You can now perform the initial SV search from both assemblies and long-read together. The variants discovered with each method will be merged together for the filtering and genotyping. - :new: parameters with defaults added to control time, cpu and memory for each process. This is useful to manage cluster requests when `-profile cluster` is used. - :beetle: bug fix: merging of variant now only occurs for the same SVTYPE flag (INS or DEL).
- :new: feature: adds `sniffles2` as an alternative to `svim-asm` in order to start SV search from long reads (instead of a genomic assembly).
- Using the parameter `--longreads` instead of `--assembly` (see inputs) will prompt `GraffiTE` to use `sniffles2`
- For now, `svim-asm` and `sniffles2` pipeline are separated (either `--longreads` or `--assembly`. We will soon allow to merge the findings of both callers before filtering for repeats.
- :new: feature: adds a divergence preset option to `minimap2` ahead of `svim-asm`. Use the flag `--asm_divergence
- :new: feature: adds `--RM_vcf` and `--RM_dir` input options. Allows to start a run directly at the TSD search step by providing the VCF and `repeatmasker_dir` produced by the processes `repeatmasker` or `repeatmasker_fromVCF` (found in the output folder `2_Repeat_Filtering`). This is useful if a run crashed during any of the TSD search processes and the job is not recoverable by Nextflow. Providing `--RM_vcf` and `--RM_dir` will bypass SV calling with `minimap2/svim_asm` (`svim_asm` process) and `repeatmasker/repeatmasker_fromVCF` processes. - :beetle: bug fix: TSD search is now performed by batches of 100 variants, which will reduce by a factor 100 the number of temporary working directories (which can cause storage to run over inodes' quota). If more than 100 variants are present, TSDs will be searched in parallel batches (up to the number of available CPUs).
- :new: feature: adds two new read aligners: [`giraffe`](https://github.com/vgteam/vg#mapping) (short read optimized, works also with long-reads) and [`graphAligner`](https://github.com/maickrau/GraphAligner) (long-read, error-prone compliant). - usage: `--graph_method [pangenie/giraffe/graphaligner]` default: `pangenie` (short accurate reads) - :new: feature: adds `--vcf` input option: requires a sequence resolved (REF and ALT allele sequences in VCF). Will bypass genome alignments and proceed with repeat annotations, TSD search, and reads mapping (optional). - :new: feature: adds `--graffite_vcf` input option: requires a VCF created by `GraffiTE` (in the outputs `3_TSD_search/pangemome.vcf`). Will skip all steps but read mapping. - :beetle: bug fix: remove the dependency to `biomartr`
- first release
It is required to update both the repository (git pull
) and image to see changes
GraffiTE
is a Nextflow
pipeline, with all the dependencies wrapped in an Apptainer
image. It is thus compatible with any Linux system including HPCs.
Note that we have received report that Apptainer installation with Conda can cause issues. We recommend to install Apptainer directly.
GraffiTE
pipeline and Apptainer image for local use. Later runs will skip the slow download step. It is however required to add the repository to the apptainer list, by typing:apptainer remote add --no-login SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloud
Alternatively, this repository can be cloned and the apptainer image downloaded at a specific location:
git clone https://github.com/cgroza/GraffiTE.git
apptainer remote add --no-login SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloud
apptainer pull --arch amd64 graffite_latest.sif library://cgroza/collection/graffite:latest
nextflow.config
from library://cgroza/collection/graffite:latest
to <your-path>/graffite_latest.sif
. Alternatively, the Nextflow
command -with-singularity <your-path>/graffite_latest.sif
can be used when running GraffiTE
(it will override the presets in nextflow.config
).We now provide a Docker image too. Apptainer can obtain it with:
apptainer pull graffite_latest.sif docker://cgroza/graffite
Of course, you can also reconfigure nextflow to use the Docker image without Apptainer if your system supports Docker.
We are aware of a common issue araising when the pipeline call a temporary directory (/tmp). The most common symptom is that though the program may complete without error, it skips over "tsd_search" and "tsd_report". The program will not produce a vcf file (3_TSD_search/pangenome.vcf
) and the vcf in 2_Repeat_Filter
has no variants. While we will try to fix this in a next update, an easy fix is to ammend the nextflow.config
file as follow.
Locate the file:
~/.nextflow/assets/cgroza/GraffiTE/nextflow.config
Ammend the file:
replace:
singularity.runOptions = '--contain'
with
singularity.runOptions = '--contain -B <path-to-writable-dir>/:/tmp'
replace
<path-to-writable-dir>
with any writable path on your host machine
GraffiTE
is as follow:nextflow run cgroza/GraffiTE \
--assemblies assemblies.csv \
--TE_library library.fa \
--reference reference.fa \
--graph_method pangenie \
--reads reads.csv
nextflow run <path-to-install>/GraffiTE/main.nf \
--assemblies assemblies.csv \
--TE_library library.fa \
--reference reference.fa \
--reads reads.csv [-with-singularity <your-path>/graffite_latest.sif]
As a
Nextflow
pipeline, commad line arguments forGraffiTE
can be distinguished between pipeline-related commands, prefixed with--
such as--reference
andNextflow
-specific commands, prefixed with-
such as-resume
(seeNextflow
documentation).
A small test set is included in the test/human_test_set.tar.gz
file.
Download and decompress the file and run:
nextflow run https://github.com/cgroza/GraffiTE --reference hs37d5.chr22.fa --assemblies assemblies.csv --reads reads.csv --TE_library human_DFAM3.6.fasta
This will show a complete run of the GraffiTE pipeline, with the output stored in out
.
To make sure that you are running the latest version of GraffiTE, you can update the pipeline using the following command:
nextflow pull -r main https://github.com/cgroza/GraffiTE
In case nextflow returns the following error:
Checking https://github.com/cgroza/GraffiTE ...
cgroza/GraffiTE contains uncommitted changes -- cannot pull from repository
You'll need to first delete the cached, older verison like so:
rm -rf ~/.nextflow/assests/cgroza/GraffiTE/
nextflow pull -r main https://github.com/cgroza/GraffiTE
--assemblies
: a CSV file that lists the genome assemblies and sample names from which polymorphisms are to be discovered. One assembly per sample and sample names must be unique. The header is required.
Example assemblies.csv
:
path,sample
/path/to/assembly/sampleA.fa,sampleA_name
/path/to/assembly/sampleB.fa,sampleB_name
/path/to/assembly/sampleZ.fa,sampleZ_name
With some scaffolded genomes, minimap2 can return an error related to some CIGAR string being too long, typically
[E::parse_cigar] CIGAR length too long at position ...
and would crash. We now provide the option--break_scaffolds
(false
by default) that automatically split contigs at runs of N > 4.
AND/OR
--longreads
: a CSV file that lists the longreads FASTQ, sample names, and type of longreads (hifi/pb/ont/lr:hq) from which polymorphisms are to be discovered. One FASTQ per sample and sample names must be unique. The header is required.
Example longreads.csv
:
path,sample,type
/path/to/reads/sampleA.fq.gz,sampleA_name,pb
/path/to/reads/sampleB.fq.gz,sampleB_name,hifi
/path/to/reads/sampleC.fq.gz,sampleC_name,lr:hq
/path/to/reads/sampleZ.fq.gz,sampleZ_name,ont
AND (always required)
--TE_library
: a FASTA file that lists the consensus sequences (models) of the transposable elements to be discovered. Must be compatible with RepeatMasker
, i.e. with header in the format: >TEname#class/superfamily
for example AluY#SINE/Alu
. The library can include a single repeat model or all the known repeat models of your species of interest.
class
in >TEname#class/superfamily
): LINE
, LTR
, SINE
, RC/Helitron
(will be treated as DNA/RC
), DNA
, TIR
, MITE
, Retroposon
, IS
, Unknown
, Unspecified
Unknown
(e.g. >TEnameZ
)>TEnames
and Superfamily
will be accepted as long as the Class
name is among those supported.Dfam.h5
or Dfam_curatedonly.h5
files) and use the tool FamDB to extract the consensus for your model: famdb.py -i <Dfam.h5> families -f fasta_name -a <taxa> --include-class-in-name > TE_library.fasta
> If the Repbase (or other library) format at hand is of the type
>TEname<tab/space>Classification<tab/space>Anything, you can use this
awkcommand to easily convert your library in a format compatible with GraffiTE:
awk '/>/ {print $1"#"$2; next} !/>/ {print $0}' library.fasta > library_OK.fasta`--reference
: a reference genome of the species being studied. All assemblies or long-reads in input are compared to this reference genome.
--graph_method
: can be pangenie
, giraffe
or graphaligner
, select which graph method will be used to genotyped TEs. Default is pangenie
and it is optimized for short-reads. giraffe
can handle both short and long reads, and graphaligner
is optimized for long reads.
Note that both
giraffe
andgraphaligner
will spawn a process calledgraphAlignReads
, whilepangenie
will spawn a process calledpangenie
.
--reads
: a CSV file that lists the read sets (FASTQ/FASTQ.GZ) and sample names from which polymorphisms are to be genotyped. These samples may be different than the genome assemblies. The header is required. Only one FASTQ/FASTQ.GZ per sample, and sample names must be unique. Paired-end reads must be concatenated into a single file (Pangenie
). In case --longreads
is used as input, the same table can be used for --longreads
and --reads
(but not the opposite: type
column is needed in --longreads
, optional for --reads
).
Example reads.csv
:
path,sample
/path/to/reads/sample1.fastq,sample1_name
/path/to/reads/sample2.fastq,sample2_name
/path/to/reads/sampleN.fastq,sampleN_name
or
path,sample,type
/path/to/reads/sampleA.fq.gz,sampleA_name,pb
/path/to/reads/sampleB.fq.gz,sampleB_name,hifi
/path/to/reads/sampleZ.fq.gz,sampleZ_name,ont
--out
: if you would like to change the default output directory (out/
).--genotype
: true or false. Use this if you would like to discover polymorphisms in assemblies but you would like to skip genotyping polymorphisms from reads.--tsd_win
: the length (in bp) of flanking region (5' and 3' ends) for Target Site Duplication (TSD) search. Default 30bp. By default, 30bp upstream and downstream each variant will be added to search for TSD. (see also TSD section)--cores
: global CPU parameter. Will apply the chosen integer to all multi-threaded processes. See here for more customization.--mammal
: Apply mammal-specific annotation filters (see Mammal filter section for more details).
C,+
strand pattern. --break_scaffolds
: true or false. Break input assemblies at runs of Ns. Use this if the assemblies passed with --assemblies
are scaffolded to avoid [E::parse_cigar] CIGAR length too long
error.These parameters can be used to bypass different steps of the pipeline.
--vcf
: a sequence resolved VCF containing both REF and ALT variants sequences. This option will bypass the SV discovery and will proceed to annotate and filter the input VCF for repeats and TSD, as well as genoyping (unless --genotype false
is set). The VCF must be biallelic (can be done with bcftools norm -m-
) and each variant must have a unique variant ID that is also a valid FASTA contig name (no >
character in variant ID). For those who wish to start directly from a graph genome in GFA format, we currently recommend to convert the graph to VCF using vg deconstruct
, then modify the VCF as described above.--RM_vcf
+--RM_dir
: bypasses SV discovery and filtering (RepeatMasker) and starts at the TSD search process. --RM_vcf
can be found in the outputs: 2_Repeat_Filtering/genotypes_repmasked_filtered.vcf
and --RM_dir
in 2_Repeat_Filtering/repeatmasker_dir
--graffite_vcf
: Use this if you already have a VCF file that was produced by GraffiTE (see output: 3_TSD_Search/pangenome.vcf
), or from a difference source and would like to use the graph genotyping step. The file must be a fully-phased VCF. Note that TE annotation won't be performed on this file (see --vcf
instead), and only genotyping will be performed.svim-asm
(from assemblies)--map_asm_threads
: number of cores allocated to minimap2
or winnowmap
in the map_asm
process. Overrides --cores
.
--map_asm_memory
: RAM limit for the minimap2
or winnowmap
in the map_asm
process. Default is unset.
--map_asm_time
: for cluster
profile, max time for the scheduler for the map_asm
process. Default is 1h.
--svim_asm_threads
: number of cores allocated to svim_asm
process. Overrides --cores
.
--svim_asm_memory
: RAM limit for the SV search in svim_asm
process. Default is unset.
--svim_asm_time
: for cluster
profile, max time for the scheduler for this process. Default is 1h.
--asm_divergence
: divergence preset option for minimap2
ahead of svim-asm
. Use the flag . asm5
/asm10
/asm20
Defaults is asm5
(< 5% expected divergence between assembly and reference genome). See minimap2 documentation.
--mini_K
: minimap2
parameter -K
. Number of bases loaded into memory to process in a mini-batch. Similar to option -I, K/M/G/k/m/g suffix is accepted. A large NUM helps load balancing in the multi-threading mode, at the cost of increased memory. Default 500M.
--stSort_m
: samtools sort
parameter -m
(for each alternative assembly, post-minimap2
): Approximately the maximum required memory per thread, specified either in bytes or with a K, M, or G suffix. Default in GraffiTE
is 4G.
--stSort_t
: samtools sort
parameter @
(for each alternative assembly, post-minimap2
): Set number of sorting and compression threads. Default in GraffiTE
is 4 threads.
sniffles2
(from long reads)--map_longreads_threads
: number of cores allocated to minimap2
or winnowmap
in the map_longreads
process. Overrides --cores
.--map_longreads_memory
: RAM limit for the minimap2
or winnowmap
in the map_longreads
process. Default is unset.--map_longreads_time
: for cluster
profile, max time for the scheduler for the map_longreads
process. Default is 1h.
---sniffles_threads
: number of minimap2
threads (parameter -t
in minimap2
). Overrides --cores
.
---sniffles_memory
: RAM limit for the SV search (minimap2
+sniffles2
) process. Default is unset.
---sniffles_time
: for cluster
profile, max time for the scheduler for this process. Default is 2h.--stSort_m
: samtools sort
parameter -m
(for each long-read alignment, post-minimap2
): Approximately the maximum required memory per thread, specified either in bytes or with a K, M, or G suffix. Default in GraffiTE
is 4G.--stSort_t
: samtools sort
parameter @
(for each long-read alignment, post-minimap2
): Set number of sorting and compression threads. Default in GraffiTE
is 4 threads. --repeatmasker_threads
: number of RepeatMasker threads. Overrides --cores
.--repeatmasker_memory
: RAM limit for the RepeatMasker (annotation) process. Default is unset.--repeatmasker_time
: for cluster
profile, max time for the scheduler for this process. Default is 2h.--pangenie_threads
: number of Pangenie
threads. Overrides --cores
.--pangenie_memory
: RAM limit for the Pangenie (genotyping) process. Default is unset.--pangenie_time
: for cluster
profile, max time for the scheduler for this process. Default is 2h.vg call
--make_graph_threads
: threads for creating the graph with vg autoindex
(Giraffe) or vg construct
(GraphAligner). Default is 1.
--make_graph_memory
: RAM limit for creating the graph with vg autoindex
(Giraffe) or vg construct
(GraphAligner). Default is unset.
--graph_align_theads
: threads for aligning reads to the graph with vg giraffe
or GraphAligner
. Default is 1.
--graph_align_memory
: RAM limit for aligning reads to the graph with vg giraffe
or GraphAligner
. Default is unset.
--graph_align_time
: for cluster
profile, max time for the scheduler for this process. Default is 12h.
--vg_call_threads
: threads for calling genotypes with vg call
on graph alignments. Default is 1.
--vg_call_memory
: RAM limit for calling genotypes with vg call
on graph alignments. Default is unset.
--min_mapq
: Minimum mapping quality to consider when counting read depth on nodes. Default is 0.
--min_support
: Minimum required read depth on allele,bubble
to consider for genotyping. The first number is the minimum read depth on allele, and the second is the minimum depth on the entire bubble/locus. Default is 2,4
.
In the main publication of GraffiTE, we refer to different "modes" relative to the different combination of assembly, long-reads (for discovery) and reads (for genotyping). The following table recapitulate the arguments to use in order to repliate these modes. Please refer the the reads file description above for proper formating.
Mode | Arguments | Description |
---|---|---|
GT-sv | --assemblies |
pME discovery from assemblies |
GT-sn | --longreads |
pME discovery from long reads |
GT-svsn | --assemblies |
pME discovery from both assemblies and long reads |
GT-sv-PG | --assemblies |
pME discovery from assemblies and genotyping from short reads with Pangenie |
GT-sn-PG | --longreads |
pME discovery from long reads and genotyping from short reads with Pangenie |
GT-svsn-PG | --assemblies |
pME discovery from both assemblies and short reads and genotyping from short reads with Pangenie |
GT-sv-GA | --assemblies |
pME discovery from assemblies and genotyping from long reads with GraphAligner |
GT-sn-GA | --longreads |
pME discovery from long reads and genotyping from long reads with GraphAligner |
GT-svsn-GA | --assemblies |
pME discovery from both assemblies and short reads and genotyping from long reads with GraphAligner |
Nextflow
parametersNextflow
-specific parameters can be passed in addition to those presented above. These parameters can be distinguished by the use of a single -
, such as -resume
. See Nextflow
documentation for more details.
-resume
: if nothing is changed in the command line and the /work
folder created by Nextflow
, the pipeline will resume after the last chached process.-with-singularity
: if a local apptainer image is used, this parameter will override the default image path given in nextflow.config
.-with-report report.html
: for a Nextflow report on resource usage to help tune the CPU and memory parameters for your genome/species.The results of GraffiTE
will be produced in a designated folder with the option --out
. The output folder contains up to 4 sub-folders (3 if --genotype false
is set). Below is an example of the output folder using two alternative assemblies of the human chromosome 1 (maternal and paternal haplotypes of HG002) and two read-sets from HG002 for genotyping.
OUTPUT_FOLDER/
βββ 1_SV_search
βΒ Β βββ HG002_mat.vcf
βΒ Β βββ HG002_pat.vcf
βββ 2_Repeat_Filtering
βΒ Β βββ genotypes_repmasked_filtered.vcf
βΒ Β βββ repeatmasker_dir
βΒ Β βββ ALL.onecode.elem_sorted.bak
βΒ Β βββ indels.fa.cat.gz
βΒ Β βββ indels.fa.masked
βΒ Β βββ indels.fa.onecode.out
βΒ Β βββ indels.fa.out
βΒ Β βββ indels.fa.out.length
βΒ Β βββ indels.fa.out.log.txt
βΒ Β βββ indels.fa.tbl
βΒ Β βββ onecode.log
βΒ Β βββ OneCode_LTR.dic
βββ 3_TSD_search
βΒ Β βββ pangenome.vcf
βΒ Β βββ TSD_full_log.txt
βΒ Β βββ TSD_summary.txt
βββ 4_Genotyping
βββ GraffiTE.merged.genotypes.vcf
βββ HG002_s1_10X_genotyping.vcf.gz
βββ HG002_s1_10X_genotyping.vcf.gz.tbi
βββ HG002_s2_10X_genotyping.vcf.gz
βββ HG002_s2_10X_genotyping.vcf.gz.tbi
1_SV_search/
[assembly_name].vcf
with [assembly_name]
as set in the file assemblies.csv
2_Repeat_Filtering/
genotypes_repmasked_filtered.vcf
a vcf file with the merged variants detected in each alternative assembly. The merge is made with SURVIVOR
with the parameters SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100
. Details about the vcf annotation can be found in the VCF section of the manual. This VCF contains only variants for witch repeats in the --TE_library
file span more than 80% of the sequence (from 1 or more repeat models).repeatmasker_dir/
:
indels.fa.*
: RepeatMasker
output files. indels.fa
represents all SV sequences queried to RepeatMasker
. See the RepeatMasker documentation for more information. ALL.onecode.elem_sorted.bak
: original OneCodeToFindThemAll
outputs. see here fore more details.OneCode_LTR.dic
: OneCodeToFindThemAll
LTR dictionary automatically produced from --TE_library
see here fore more details.onecode.log
: log file for OneCodeToFindThemAll
process.3_TSD_Search/
(see TSD section)
pangenome.vcf
final VCF containing all retained repeat variants and annotations (with TSD if passing the TSD filters). This file is used later by Pangenie
,Giraffe
or graphAligner
to create the genome-graph onto which reads are mapped for genotyping. (example here). Can be re-used for genotyping only with --graffite_vcf pangenome.vcf
TSD_summary.txt
: tab delimited output of the TSD search module. 1 line per variant. See TSD section for more information. "PASS" entries are reported in the pangenie.vcf
and final (with genotypes) VCF.TSD_full_log.txt:
detailed (verbose rich) report of TSD search for each SV (see TSD section).4_Genotyping/
GraffiTE.merged.genotypes.vcf
: final mutli-sample VCF with the genotypes for each sample present in the --reads
file. See VCF section for more details.*.vcf.gz
individual genotypes (do not contain TE annotation)*.vcf.gz.tbi
index for individual VCFs.Note that intermediate files will be written in the
./work
folder created byNextflow
. EachNextflow
process is run in a separate working directory. If an error occurs,Nextflow
will points to the specific working directory. Moreover, it is possible to resume interrupted jobs if the./work
folder is intact and you use the same command, plus the-resume
(1 single-
) tag after your command. It is recommended to delete the./work
folder regularly to avoid storage issues (more than space, it can aggregate a LOT of files through time). More info aboutNextflow
usage can be found here.
GraffiTE
outputs variants in the VCF 4.2 format. Additional fields are added in the INFO column of the VCF to annotate SVs containing TEs and other repeats (3_TSD_Search/pangenie.vcf
[do not contain individual genotypes, only the list of variants] and 4_Genotyping/GraffiTE.merged.genotypes.vcf
which contains a genotype column for each reads-set).
3_TSD_Search/pangenie.vcf
1 8501990 HG002_mat.svim_asm.INS.94 T TCAATACACACACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCCGGACTGCGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCCAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCATGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGGACTACAGGCGTGAGCCACCGCGCCCGGC . PASS SUPP=1;SUPP_VEC=10;SVLEN=345;SVTYPE=INS;SVMETHOD=SURVIVOR1.0.7;CHR2=1;END=8501990;CIPOS=0,0;CIEND=0,0;STRANDS=+-;n_hits=1;fragmts=1;match_lengths=316;repeat_ids=AluYb9;matching_classes=SINE/Alu;RM_hit_strands=C;RM_hit_IDs=15016;total_match_length=316;total_match_span=0.913295;mam_filter_1=None;mam_filter_2=None;TSD=AATACACACACTTTTT,AATACACACACTTTTT GT 1|0
An example of AluYb9 insertion relative to the reference genome (hg19 was used for this example). The genotype is always heterozygous in order to create both allele in the graph used for genotyping
4_Genotyping/GraffiTE.merged.genotypes.vcf
1 8501990 HG002_mat.svim_asm.INS.94 T TCAATACACACACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCCGGACTGCGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCCAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCATGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGGACTACAGGCGTGAGCCACCGCGCCCGGC . PASS UK=51;MA=0;AF=0.5;AK=13,38;CIEND=0,0;CIPOS=0,0;CHR2=1;END=8501990;SVLEN=345;SVMETHOD=SURVIVOR1.0.7;SVTYPE=INS;SUPP_VEC=10;SUPP=1;STRANDS=+-;n_hits=1;match_lengths=316;repeat_ids=AluYb9;matching_classes=SINE/Alu;fragmts=1;RM_hit_strands=C;RM_hit_IDs=15016;total_match_length=316;total_match_span=0.913295;mam_filter_1=None;mam_filter_2=None;TSD=AATACACACACTTTTT,AATACACACACTTTTT GT:GQ:GL:KC 0/1:10000:-81.8909,0,-64.99:7 0/1:10000:-81.8909,0,-64.99:7
An example of AluYb9 insertion relative to the reference genome (hg19 was used for this example). Genotypes are based on read mapping for each individual.
1 33108378 HG002_pat.svim_asm.INS.206 T TTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCACCAGACTGGAGTACAATGGCACAATCTCGGCTTACTGCAACTTCCGCCTCCTGGGTTCAAGCAATTCCCCTGCCTCAGCCTCCTGAGTAGCTGGGATTACAGACGTGTGCCACCATGCCTGGCTAATTTTTTGTATTTTA
GCAGAGACGGAGTTTCACCATGTTGGCCAGGATGCTCTCAATCTCCTTACCTCATGATCCGCCAGCCTCGGCCTCCCAAAGTGCTGGGATTATTACAGGCATGAGCCACAGTCCCAGGTCTTTAGACAAACTCAACCCATTATCAATCAAAAAATGTTTAAATTCACTTATAGCATGGAAGCTACCCCACCCCTCCCCCCTCCCCCCTCCCGCCCCCCCCAGCTTTGAGTTGTCCCACCTTTCTGGACCAAAGCA ATGTATTTCTTAAACTTAATTGATTAATGTCTCATGCCTCTCTGAAATGTATAAAACCAAACTGTGCCCTGACCACCTTGGGCACACTGAGCACATGTTCTCAGGATCTCCAGAGGGCTGTGTCAGGGGCCATGGTCACATTTGGCTCAGAATACATCTCTTCAAATATTTTATAGAGTTCGACTATTTTGTCAACAATTAAAAAGGCACCTATTCAGAAT
ATTAAAAGTTAAGATTTAATAACATCAACAGTTCTTACTGATTCATCAAATATTTTTTTTTTTGAGACCGAGTCTCGCTCTATCGCCCAGGCTGGAGGGCAGTGGCACAATCTCTGTTCACTGCAACCTCCGCCTCCCGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAATAGCTGGGACTACATGCGCGTGCCACCACGCCTGGCTAATTTTTGTATTTTTAGTAGAGACGGAGTTTCACAACGTTGGCCAGGATGGTCTCGATCCCTTGACCTCATGATCCGCCTGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCGGCGCCTGGCCAAAACAAAA .PASS K=301;MA=0;AF=0.5;AK=2,299;CIEND=0,1;CIPOS=0,0;CHR2=1;END=33108378;SVLEN=1002;SVMETHOD=SURVIVOR1.0.7;SVTYPE=INS;SUPP_VEC=11;SUPP=2;STRANDS=+-;n_hits=4;match_lengths=293,331,80,291;repeat_ids=AluSc8,MER4E1,Charlie1a,AluSc;
matching_classes=SINE/Alu,LTR/ERV1,DNA/hAT-Charlie,SINE/Alu;fragmts=1,1,1,1;RM_hit_strands=C,+,C,C;RM_hit_IDs=28269,28270,28271,28272;total_match_length=991;total_match_span=0.988036;mam_filter_1=None;mam_filter_2=None GT:GQ:GL:KC 1/1:10000:-450.343,-147.4,0:4 1/1:10000:-450.343,-147.4,0:4
A more complex example with
n_hit=4
VCF column:
(1) CHROM
: chromosome/scaffold/contig (2) POS
: position (in bp) of the SV start, relative to the reference genome(3) ID
: variant name(4) REF
: reference allele(5) ALT
: alternative allele(6) QUAL
: not used(7) FILTER
: currently not used. "PASS" is used by default but does not inform about variant quality (for now!)(8) INFO
:
UK
(4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Total number of unique kmersMA
(4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Number of alleles missing in panel haplotypesAF
(4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Allele FrequencyAK
(4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Number of unique kmers per allele. Will be -1 for alleles not covered by any input haplotype pathCIEND
(ignore)CIPOS
(ignore)CHR2
(ignore)END
: End position of the SV on the reference genomeSVLEN
: Length of the SV (bp), can be negativeSVMETHOD=SURVIVOR1.0.7;
(ignore)SVTYPE
: Type of SV (can be INS or DEL)SUPP_VEC
: Support Vector from SURVIVOR (merge of individual loci). SUPP_VEC=01 means two alternative assemblies were used, the SV is absent from the first one and present in the second one.SUPP
: Number of assemblies with the variantSTRANDS=+-;
(ignore)n_hits
: number of distinct RepeatMasker hits on the SVmatch_lengths
: length of each RepeatMasker hit. If n_hits
> 1, lengths of each hit are comma separatedrepeat_ids
: target name of each RepeatMasker hit. If n_hits
> 1, names for each hit are comma separatedmatching_classes
: classification of each RepeatMasker hit. If n_hits
> 1, classification for each hit are comma separatedfragmts
: number of fragments stitched together for each RepeatMasker hit. If n_hits
> 1, the number of stitched fragments for each hit are comma separatedRM_hit_strands
: strands for each RepeatMasker hit. If n_hits
> 1, the strands of each hit are comma separated. Can be +
or C
(complement)RM_hit_IDs
: unique RepeatMasker hit ID (last column of the .out
file of repeatmasker). If n_hits
> 1, hit IDs are comma separated. Fragments stitched with OneCodeToFindThemAll
are shown separated with /
.total_match_length
: total number of bp covered by repeats in the SVtotal_match_span
: proportion of the SV covered by repeats (minimum is 0.8)mam_filter_1
: 5P_INV
will be shown if the SV is a LINE1 with a 5' inversion; Null otherwise; (only present if --mammal
is set)mam_filter_2
: SVA_VNTR
if the SV is a length polymorphism of the VNTR region of an SVA element; Null otherwise; (only present if --mammal
is set)TSD
: Target Site Duplication (left_TSD,right_TSD); only present if TSD passes filters (see TSD section)(9) FORMAT
and (10) GENOTYPE
GT
: Genotype (0=reference allele, 1=alternative allele, .=missing)GQ
: (4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Genotype quality: phred scaled probability that the genotype is wrong.GL
: (4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Comma-separated log10-scaled genotype likelihoods for absent, heterozygous, homozygous.KC
: (4_Genotyping/GraffiTE.merged.genotypes.vcf
only): [Pangenie
] Local kmer coverage.When using Giraffe
and GraphAligner
with vg call
, the following fields are also present:
AT
: Allele traversal as path in graphDP
: Total DepthAD
: Allelic depths for the ref and alt alleles in the order listed">MAD
: Minimum site allele depthGL
: Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidyGQ
: Genotype Quality, the Phred-scaled probability estimate of the called genotypeGP
: Genotype Probability, the log-scaled posterior probability of the called genotypeXD
: eXpected Depth, background coverage as used for the Poisson modelFor SVs with a single TE insertion detected (n_hits=1
, and LINE1s with the flag mam_filter_1=5P_INV
) target site duplication are searched by comparing the flanking regions following this workflow:
blastn
to align with a seed of 4 bp RepeatMasker
, "Ns" nucleotides)The script also account for the presence of poly-A/T
TSD_summary.txt
output file (The header is not present in the real file).
SV_name RM_family_name RM_hit_strand RM_hit_divergence TSD_length Mismatches Gaps 5P_TSD_end 5P_offset 3P_TSD_start 3P_offset 5P_TSD 3P_TSD FILTER
HG002_mat.svim_asm.DEL.1014 AluY C 2.2 10 0 0 -1 0 1 0 ATTATTATTA ATTATTATTA PASS
HG002_mat.svim_asm.DEL.1013 L1HS C 1.3 16 0 0 -15 3 1 0 AGTATTCTGGATTTTT AGTATTCTGGATTTTT FAIL
G002_mat.svim_asm.DEL.1015 L1HS + 1.0738 4 0 0 -9 0 1 0 AAAG AAAG FAIL
HG002_mat.svim_asm.DEL.102 AluYa5 C 0.3 11 0 0 -1 0 1 0 CTGCATACTTT CTGCATACTTT PASS
HG002_mat.svim_asm.DEL.1011 L1P2 C 6.9 4 0 0 -21 0 1 0 CATC CATC FAIL
HG002_mat.svim_asm.DEL.1005 AluY C 1.0 12 0 0 -1 0 1 0 CCAGAAGTCTTT CCAGAAGTCTTT PASS
HG002_mat.svim_asm.DEL.1010 AluYh3 + 2.4 12 0 0 -1 0 1 0 AATTTCTATCTC AATTTCTATCTC PASS
TSD_full_log.txt:
detailed (verbose rich) report of TSD search for each SV.
--- TSD search for HG002_mat.svim_asm.DEL.1014 ---
L 5P_end ACAGGCGTGAGCCTCCACGCCTGGCCTAGATATTATTATTATTATTATTA
1 5 10 15 20 25 30 35 40 45 50 R 3P_end ATTATTATTAACCTATTTTACAGATGAGGG 1 5 10 15 20 25 30 35 40 45 50
3' poly_A: element is in C orientation, will not search for poly_A 5' poly_T: 0 bp, will not remove anything for alignment
Building a new DB, current time: 11/02/2022 22:27:12 New DB name: /scratch/cgoubert/GraffiTE/work/d1/3d8805a29e13fad52ed5aa1e7a9e76/L.short.fasta New DB title: L.short.fasta Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 1 sequences in 0.000507116 seconds.
candidate hits from blastn: R|3P_end L|5P_end 100.000 10 0 0 1 10 41 50 0.001 19.6 R|3P_end L|5P_end 100.000 4 0 0 1 4 47 50 3.1 8.5 R|3P_end L|5P_end 100.000 10 0 0 1 10 38 47 0.001 19.6 R|3P_end L|5P_end 100.000 10 0 0 1 10 35 44 0.001 19.6 R|3P_end L|5P_end 100.000 10 0 0 1 10 32 41 0.001 19.6 R|3P_end L|5P_end 100.000 8 0 0 3 10 31 38 0.018 15.9 R|3P_end L|5P_end 100.000 4 0 0 12 15 25 28 3.1 8.5 R|3P_end L|5P_end 87.500 8 0 1 14 20 37 44 3.1 8.5 R|3P_end L|5P_end 87.500 8 0 1 14 20 31 38 3.1 8.5 R|3P_end L|5P_end 100.000 4 0 0 20 23 1 4 3.1 8.5 R|3P_end L|5P_end 100.000 4 0 0 22 25 28 31 3.1 8.5 R|3P_end L|5P_end 100.000 4 0 0 25 28 8 11 3.1 8.5
candidate TSDs: ACAGGCGTGAGCCTCCACGCCTGGCCTAGATATTATTATTATTATTATTA[ <<< AluY C <<< ]ATTATTATTAACCTATTTTACAGATGAGGG βΎβΎβΎβΎβΎβΎβΎβΎβΎβΎ βΎβΎβΎβΎβΎβΎβΎβΎβΎβΎ
PASS
3' end: nothing to extend 5' end: nothing to extend SVname TEname Strand Div AlnLen MM Gaps 5P_TSD_end 5P_offset 3P_TSD_start 3P_offset 5P_TSD 3P_TSD HG002_mat.svim_asm.DEL.1014 AluY C 2.2 10 0 0 -1 0 1 0 ATTATTATTA ATTATTATTA PASS
--mammal
In order to account for the particularities of several TE families, we have introduced a --mammal
flag that will search for specific features associated with mammalian TEs. So far we are accounting for two particular cases: 5' Inversion of L1 elements and VNTR polymorphism between orthologous SVA insertions. We will try to add more of these filters, for example to detect solo vs full-length LTR polymorphisms. If you would like to see more of these filters, please share your suggestions on the Issue page!
SV detected by GraffiTE and corresponding to non-canonical TPRT (Target Primed Reverse Transcription), such as Twin Priming (see here and here) may be skipped by the TSD script because it artificially creates 2 hits instead of one for a single TE insert.
Whether or not the L1 is inserted on the + or - strand, at Twin-Primed L1 will have the same pattern with RepeatMasker:
This is because an inversion on the - strand feature will look like + on the consensus (
(-)*(-) = (+)
or a "reverted reverse")
However, we can differentiate the two based on the coordinates of the hit on the TE consensus (cartoon not to scale to compare two L1 insertions with the same consensus):
For each pair (C,+) of hits, we look at the target hit coordinates:
L1 inversions will be reported with the flag mam_filter_1=5P_INV
in the INFO field of the VCFs.
If GraffiTE
detects:
The variant will be flagged with mam_filter_2=VNTR_ONLY:SVA_F:544:855
with SVA_F:544:855
varying according to the element family and VNTR region:
SVA model | VNTR period size | Repeat # | start | end |
---|---|---|---|---|
SVA_A | 37 | 10.5 | 436 | 855 |
SVA_B | 37 | 10.8 | 431 | 867 |
SVA_C | 37 | 10.5 | 432 | 851 |
SVA_D | 37 | 6.4 | 432 | 689 |
SVA_E | 37 | 10.8 | 428 | 864 |
SVA_F | 37 | 10.5 | 435 | 857 |
GraffiTE
execution profilesBy default, the pipeline will inherit the nextflow
configuration and run accordingly.
To execute locally, on SLURM, or AWS, pass one of the -profile
provided with the GraffiTE
:
standard
cluster
cloud
For example,
nextflow run cgroza/GraffiTE -profile cluster ...
will run on SLURM.
You may alter the following parameters on the command line or in your own nextflow
configuration file to change how many CPUs and how much memory will be required by each step.
Step 1, polymorphisms discovery. The memory requirement depends on the genome size of the species. More cores is faster.
params.svim_asm_memory
params.svim_asm_threads
Step 2, merging polymorphisms. The requirements depends on the number of assemblies.
params.make_vcf_memory
params.make_vcf_threads
Step 3, genotyping polymorphisms from reads. The memory requirements depend on the genome size and size of the read sets. More cores is faster.
params.pangenie_memory
params.pangenie_threads
The requirements are numbers or strings accepted by nextflow
. For example, 40 for number of CPUs and '100G' for memory.
Model species | Ref genome size / TE content (%) | Input sample (if applicable) | process | # of processes measured | CPUs (available) | RAM (peak) | Process run time |
---|---|---|---|---|---|---|---|
human | 3 Gbp / 50% | haploid assemblies | minimap2 | 2 | 40 | 46-47Gb | 38-49 mn |
5X HiFi long reads | minimap2 | 1 | 40 | 65Gb | 20 mn | ||
10X HiFi long reads | minimap2 | 1 | 40 | 78Gb | 39 mn | ||
20X HiFi long reads | minimap2 | 1 | 40 | 99Gb | 1h 14 mn | ||
30X HiFi long reads | minimap2 | 1 | 40 | 109Gb | 1h 48 mn | ||
VCF | RepeatMasker | 1 | 40 | 2Gb | 37 mn | ||
VCF | make_graph (vg) | 2 | 1 | 4-8Gb | 6 mn | ||
30X HiFi long reads | GraphAligner | 2 | 40 | 124-125Gb | 4h 1 mn | ||
30X Illumina | Pangenie | 2 | 40 | 85-87Gb | 46 mn | ||
C. sativa | 740 Mbp / 70% | haploid assemblies | minimap2 | 9 | 40 | 56-118Gb | 12 mn-1h 31 mn |
Long reads (pb, hifi or ONT) | minimap2 | 5 | 40 | 58-93Gb | 3h 18 mn-9h 24 mn | ||
VCF | RepeatMasker | 1 | 40 | 5Gb | 7h 24 mn | ||
Z. mays | 2.4 Gbp / 85% | assembly | minimap2 | 1 | 40 | 143G | 74.4 mn |
70X PacBio longreads | minimap2 | 1 | 40 | 57G | 13h 42 mn | ||
VCF | RepeatMasker | 1 | 40 | 31G | 203 mn | ||
VCF | make_graph (vg) | 1 | 40 | 11G | 4 mn | ||
70X PacBio longreads | GraphAligner | 1 | 40 | 66 G | 11h 25 mn | ||
Algined GAM | vg call | 1 | 40 | 12 G | 8 mn |
The default parameters, in particular for request of RAM and execution time, may be inssuficient for large, complex and repeat rich genomes such as Maize and other models. Nextflow's error message may be hard to interpret and sometimes misleading with regards to the actual cause of the error. We advise users that suspect their model to be challenging for GraffiTE to initially use as much ressource are necessary. So far, a higher bound of 120h and 400Gb per process (these being requested resources, not actual usage -- for actual usage, see above) have been reported to allow successful run with Maize models, the most ressource intensive step being long-read alignments.
Please use and abuse the Issue section of this Github page. With the userbase growing, it becomes more likely that someone else has already encountered a similar issue. If not, other users will benefits from your experience! We will try to respond swiftly to any help request, and the Issue page is the only place we actively monitor for user support. Thank you!
The "stitching" method to identify unique TE insertion from fragmented hits has some degree of limitation. This can be flagrant for full-length LTR insertion, which can show n_hits
> 1, and thus wont be recognized as a "single" element insertion, nor run through the TSD module. For now, names between LTR and I(nternal) sequences much match in the header name (e.g. TIRANT_LTR and TIRANT_I) to be automatically recognized as a single hit. We will make use of the RepeatMasker hit ID in order to improve this stitching procedure. In the meantime, we recommend to check/rename your LTR of interest in the --TE_library
file.
As mentioned above, in order to improve runtime, the TSD module is only run for SVs with a single TE hit. We will improve this feature in order to be able to run the module on all SVs.
There are currently several bottlenecks in the pipeline: samtools sort
can be tricky to parallelize properly (piped from minimap2
alignments, which are often fast) and the performance will depends on the genomes size, complexity and the parameter used. RepeatMasker
can be slow with a large number of SVs and a large library, hang-on! If you find satisfactory combinations of parameters for your model, please share them in the issues section! Thanks!
Groza, C., Chen, X., Wheeler, T.J. et al. A unified framework to analyze transposable element insertion polymorphisms using graph genomes. Nat Commun 15, 8915 (2024). https://doi.org/10.1038/s41467-024-53294-2