GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
121
stars
6
forks
source link
Replace `pangenome.vcf` with a `presence-absence.vcf` as main output, but keep it to build the graph genomes #30
Replace pangenome.vcf with a presence-absence.vcf in the 3_TSD_Search/ output folder. This new file will show 1 genotype column per sample but the calls are only 1 or 0 (i.e. identical to the SUPP_VEC field). We still need to output pangenome.vcf for compatibility with the option --graffite-vcf (skips SV search and annotation, and use the VCF provided to build graph and map reads). Alternatively, don't output pangenome.vcf, but keep it internally to build the graph if needed. This would require to modify the routines for --graffite-vcf in order to strip the genotype column and replace them with a single column with all variants 1|0.
I anticipate a possible source of confusion as "presence-absence" could be interpreted as the presence or absence of a TE rather than presence/absence of the variant. Perhaps a solution to this is to output two files, one in VCF format, respecting the VCF convention and called GraffiTE_variants_presence-absence.vcf and the other being tsv table, identical to the non-header lines of the VCF but where the DEL calls are reverted to match the presence/absence pattern of the TEs for each sample. We could call this file GraffiTE_TE_presence-absence.tsv.
Of course, will need to update the documentation accordingly.
This change has several advantages:
1) it is more explicit and easier to interpret, either seing 1 (alt allele) or 0 (ref allele) in the VCF for each variants/sample combination in the VCF or 1 (TE presence) or 0 (TE absence) in the TSV for each TE/sample.
2) it should be easier to parse than the SUPP_VEC
3) it avoids having to pull the vcf.txt file from in order to know which position of the SUPP_VEC correspond to which sample.
Replace
pangenome.vcf
with apresence-absence.vcf
in the3_TSD_Search/
output folder. This new file will show 1 genotype column per sample but the calls are only 1 or 0 (i.e. identical to the SUPP_VEC field). We still need to outputpangenome.vcf
for compatibility with the option--graffite-vcf
(skips SV search and annotation, and use the VCF provided to build graph and map reads). Alternatively, don't outputpangenome.vcf
, but keep it internally to build the graph if needed. This would require to modify the routines for--graffite-vcf
in order to strip the genotype column and replace them with a single column with all variants1|0
.I anticipate a possible source of confusion as "presence-absence" could be interpreted as the presence or absence of a TE rather than presence/absence of the variant. Perhaps a solution to this is to output two files, one in VCF format, respecting the VCF convention and called
GraffiTE_variants_presence-absence.vcf
and the other being tsv table, identical to the non-header lines of the VCF but where the DEL calls are reverted to match the presence/absence pattern of the TEs for each sample. We could call this fileGraffiTE_TE_presence-absence.tsv
.Of course, will need to update the documentation accordingly.
This change has several advantages: 1) it is more explicit and easier to interpret, either seing 1 (alt allele) or 0 (ref allele) in the VCF for each variants/sample combination in the VCF or 1 (TE presence) or 0 (TE absence) in the TSV for each TE/sample. 2) it should be easier to parse than the SUPP_VEC 3) it avoids having to pull the
vcf.txt
file from in order to know which position of the SUPP_VEC correspond to which sample.