run in ERROR "Repeatmasking VCF insertion sequences failed, exiting..."

wangnan9394 commented 3 years ago

Hi, I am good at test folder. It's a great software. But, when i test a 2.2 Gb reads on my genome. The process was broken, and here is the details. Unfortunately, i could not find where is the key. Could you give me a hand?

Master RepeatMasker Database: /root/miniconda/envs/TELR_env/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 )
Custom Repeat Library: Csiv4.chromosome.fa.mod.EDTA.TElib.fa

Warning...unknown stuff <
>

analyzing file /work/test6-24_output/intermediate_files/unishu.merge.vcf_ins.fasta
identifying matches to Csiv4.chromosome.fa.mod.EDTA.TElib.fa sequences in batch 1 of 1

No repetitive sequences were detected in /work/test6-24_output/intermediate_files/unishu.merge.vcf_ins.fasta
[Errno 2] No such file or directory: '/work/test6-24_output/intermediate_files/vcf_ins_repeatmask/unishu.merge.vcf_ins.fasta.out.gff'
Repeatmasking VCF insertion sequences failed, exiting...
(TELR_env) root@6b12b58b46ff:/work#

Bests, Nan

wangnan9394 commented 3 years ago

Does the VCF from pbsv (pbmm2 aligner) also work in this pipeline? :)

shunhuahan commented 3 years ago

Hi @wangnan9394,

Sorry for the late reply! It appears that none of the representative SV-reads in the Sniffles VCF file can be repeatmasked using your custom repeat library, which could be either due to 1) the custom repeat library is not complete or is in low quality/ the inserted sequence is too divergent from consensus sequence in your library file or 2) the putative SV-reads provided by Sniffles are incorrect and do not include TE insertion in them.
I will make an update to TELR so that this error could be handled more properly and would inform user about what might cause this error (as I suggested above).
If it turns out that the custom repeat library is not complete or is too divergent from the real inserted sequence, we are currently working on an upgrade to TELR so that it can impute new inserted TE sequence without having a library file.
Hope that answers your question!
Currently TELR doesn't accept pre-computed VCF file, we did some internal evaluations before and determined that Sniffles is the most reliable program to provide SV candidates. However we might include other SV software (ep. SVIM, cuteSV, pbsv) as optional argument in the future to increase flexibility, thanks for the suggestion!

Shunhua

SergeiF1987 commented 2 years ago

Hi Shunhua,

thanks a lot for this software. I would be happy to use it. Unfortunately, I get the issue mentioned previously by wangnan9394 but for test data.

Master RepeatMasker Database: /mnt/raid/sergey/miniconda/envs/TELR/share/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: dc20170127 ) Custom Repeat Library: /mnt/raid/sergey/bio-first/insertion_analysis/test_telr_default/output/intermediate_files/library.fasta

Warning...unknown stuff <

File /mnt/raid/sergey/bio-first/insertion_analysis/test_telr_default/output/intermediate_files/reads.vcf_ins.fasta appears to be empty. [Errno 2] No such file or directory: '/mnt/raid/sergey/bio-first/insertion_analysis/test_telr_default/output/intermediate_files/vcf_ins_repeatmask/reads.vcf_ins.fasta.out.gff' Repeatmasking VCF insertion sequences failed, exiting...

do you know what could be a reason for that?

By the way, when I use my own data this step seems to be passed by I get another error.

analyzing file /mnt/raid/sergey/bio-first/insertion_analysis/test/output/intermediate_files/101N_passed.part-01.te.fa identifying matches to dvir_full-size_TEs.fasta sequences in batch 1 of 1 processing output: cycle 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 6 cycle 7 cycle 8 cycle 9 cycle 10 Generating output... masking done Done

Successfully created the directory /mnt/raid/sergey/bio-first/insertion_analysis/test/output/intermediate_files/telr_reads

Usage: samtools depth [options] in1.bam [in2.bam [...]] Options: -a output all positions (including zero depth) -a -a (or -aa) output absolutely all positions, including unused ref. sequences -b list of positions or regions -f list of input BAM filenames, one per line [null] -l read length threshold (ignore reads shorter than ) [0] -d/-m maximum coverage depth [8000]. If 0, depth is set to the maximum integer value, effectively removing any depth limit. -q base quality threshold [0] -Q mapping quality threshold [0] -r region --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name, position, and coverage depth. Note that positions with zero coverage may be omitted by default; see the -a option.

/bin/sh: 1: _137386_137390:5972-6022: not found /bin/sh: 1: _137386_137390.realign.sort.bam: not found Traceback (most recent call last): File "/mnt/raid/sergey/miniconda/envs/TELR/bin/telr", line 10, in sys.exit(main()) File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/telr.py", line 129, in main args.thread, File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 677, in get_af bam, contig_name, start, end, te_interval_size, te_offset File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 839, in get_te_cov start + te_offset + te_interval_size, File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 867, in get_median_cov median_cov = statistics.median(covs) File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/statistics.py", line 380, in median raise StatisticsError("no median for empty data") statistics.StatisticsError: no median for empty data

but probably I need to open another issue for this.

thanks in advance for your reply. Best, Sergei

shunhuahan commented 2 years ago

Thanks for reporting these issues @SergeiF1987 and sorry for the late reply! Last week is a bit crazy.
For the first issue, it appears that the insertion sequences extracted from SV workflow is empty (reads.vcf_ins.fasta), which is different from the issue reported in https://github.com/bergmanlab/TELR/issues/5#issue-928853284.
To make sure I can reproduce the error, could you confirm that you are using the latest version of TELR (https://github.com/bergmanlab/TELR/commit/47a0e23f8718df918e6f073c25130c2bdd1bd15f) and installed the conda environment according to the README? https://github.com/bergmanlab/TELR/blob/master/docs/01_Installation.md
Also, is the first issue reported in https://github.com/bergmanlab/TELR/issues/5#issuecomment-934386774 from a clean test data run (i.e. no re-run of the same job without removing existing output folder)?
Could you send me a complete copy of the log file for the test data run? My gmail address is hanshunhua0829. Thanks a lot.
The second issue doesn't appear to be repeatmasker related. Could you open another issue page and copy the report over to the new issue? Thanks! I will take a look there. My initial impression is that it has something to do with the format of the input reference genome.

SergeiF1987 commented 2 years ago

Thanks for reporting these issues @SergeiF1987 and sorry for the late reply! Last week is a bit crazy.

For the first issue, it appears that the insertion sequences extracted from SV workflow is empty (reads.vcf_ins.fasta), which is different from the issue reported in run in ERROR "Repeatmasking VCF insertion sequences failed, exiting..." #5 (comment).

To make sure I can reproduce the error, could you confirm that you are using the latest version of TELR (47a0e23) and installed the conda environment according to the README? https://github.com/bergmanlab/TELR/blob/master/docs/01_Installation.md

Also, is the first issue reported in run in ERROR "Repeatmasking VCF insertion sequences failed, exiting..." #5 (comment) from a clean test data run (i.e. no re-run of the same job without removing existing output folder)?

Could you send me a complete copy of the log file for the test data run? My gmail address is hanshunhua0829. Thanks a lot.

The second issue doesn't appear to be repeatmasker related. Could you open another issue page and copy the report over to the new issue? Thanks! I will take a look there. My initial impression is that it has something to do with the format of the input reference genome.

Thanks to you reply! It seems that installation TELR via conda creates not the last version of the program. I have reinstalled it by using git clone than switch version (git checkout 47a0e23f8718df918e6f073c25130c2bdd1bd15f). Test run completed successfully but the second issue with my own data unfortunately wasn't solved. I will open a new issue for that. Thanks!

shunhuahan commented 2 years ago

Thanks for letting me know that the test run is now working properly using the latest TELR version. @SergeiF1987
I'm closing this issue. Free feel to reopen it if the same issue pops up again.

bergmanlab / TELR

run in ERROR "Repeatmasking VCF insertion sequences failed, exiting..." #5