cgroza / GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.
Other
106 stars 4 forks source link

GraffiTE ends prematurely at "tsd_prep" with no output. [FIXED: check chromosome names] #9

Closed clemgoub closed 1 year ago

clemgoub commented 1 year ago
    Thanks for the tip, I ended up removing the "--contain" flag altogether, since it always seemed to insist on going to the $TMPDIR path whatever I tried to bind. That worked, however, now the pipeline stops after the first TSD step. The job finished as successful, no error was generated. But the pipeline is not complete I think, as there is no 3_... folder in the output. The last result folder generated is `2_Repeat_Filtering` with 

genotypes_repmasked_filtered.vcf repeatmasker_dir

The job output:

executor >  local (1)
[ba/f7b952] process > svim_asm (5)     [100%] 16 of 16, cached: 16 ✔
[06/c2782e] process > repeatmasker (1) [100%] 1 of 1, cached: 1 ✔
[bd/7be51f] process > tsd_prep (1)     [100%] 1 of 1 ✔
[-        ] process > tsd_search       -
[-        ] process > tsd_report       -

The .command.sh:

#!/bin/bash -ue
ls *.vcf > vcfs.txt
SURVIVOR merge vcfs.txt 0.1 0 0 0 0 100 genotypes.vcf
repmask_vcf.sh genotypes.vcf genotypes_repmasked.vcf.gz combi_repmod_repbase_26_01_dfam_3_5_insecta.lib
bcftools view -G genotypes_repmasked.vcf.gz |     awk -v FS='   ' -v OFS='  '     '{if($0 ~ /#CHROM/) {$9 = "FORMAT"; $10 = "ref"; print $0} else if(substr($0, 1, 1) == "#") {print $0} else {$9 = "GT"; $10 = "1|0"; print $0}}' |     awk 'NR==1{print; print "##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">"} NR!=1' |     bcftools view -i 'INFO/total_match_span > 0.80' -o genotypes_repmasked_temp.vcf
fix_vcf.py --ref hifiasm_scaff10x_arks.fa.masked --vcf_in genotypes_repmasked_temp.vcf --vcf_out genotypes_repmasked_filtered.vcf

Originally posted by @dewuem in https://github.com/cgroza/GraffiTE/issues/8#issuecomment-1331122268

clemgoub commented 1 year ago

This issue happens when commas are found in the chromosome (or scaffold, contig) name. See thread #8 for the details It seems that replacing commas by underscores doesn't fix, but plain text in contig name works. Re-open if the issues repeats otherwise!