after repeatmasker TELR fails with an error "/bin/sh: 1: file_name.realign.sort.bam: not found"

SergeiF1987 commented 2 years ago

Hi Shunhua,

Thank you for this software! At the first glance, it has everything that I was looking for identification of TE insertions including allele frequency. Test run passed successfully but when I use my own data it fails with an error which looks like this:

analyzing file /mnt/raid/sergey/bio-first/insertion_analysis/test/output/intermediate_files/101N_passed.part-01.te.fa
identifying matches to dvir_full-size_TEs.fasta sequences in batch 1 of 1
processing output:
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
cycle 8
cycle 9
cycle 10
Generating output...
masking
done
Done

Successfully created the directory /mnt/raid/sergey/bio-first/insertion_analysis/test/output/intermediate_files/telr_reads

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
-a output all positions (including zero depth)
-a -a (or -aa) output absolutely all positions, including unused ref. sequences
-b list of positions or regions
-f list of input BAM filenames, one per line [null]
-l read length threshold (ignore reads shorter than ) [0]
-d/-m maximum coverage depth [8000]. If 0, depth is set to the maximum
integer value, effectively removing any depth limit.
-q base quality threshold [0]
-Q mapping quality threshold [0]
-r chr:from-to region
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth. Note that positions with zero coverage may be
omitted by default; see the -a option.

/bin/sh: 1: _137386_137390:5972-6022: not found
/bin/sh: 1: _137386_137390.realign.sort.bam: not found
Traceback (most recent call last):
File "/mnt/raid/sergey/miniconda/envs/TELR/bin/telr", line 10, in
sys.exit(main())
File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/telr.py", line 129, in main
args.thread,
File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 677, in get_af
bam, contig_name, start, end, te_interval_size, te_offset
File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 839, in get_te_cov
start + te_offset + te_interval_size,
File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/site-packages/telr/TELR_te.py", line 867, in get_median_cov
median_cov = statistics.median(covs)
File "/mnt/raid/sergey/miniconda/envs/TELR/lib/python3.6/statistics.py", line 380, in median
raise StatisticsError("no median for empty data")
statistics.StatisticsError: no median for empty data

Could you take a look at this issue?

Thanks in advance! Best, Sergei

shunhuahan commented 2 years ago

@SergeiF1987 Thanks for opening this issue!
My initial impression is that the issue may be caused by input files naming or formatting, but I need more information to confirm whether this is the case for your run.
For the real data run, could you share the TELR.log file under the TELR output directory?
Are the output messages you posted in https://github.com/bergmanlab/TELR/issues/16#issue-1023823029 from the complete standard output/errors? If not, could you share the complete standard output/errors with me? These files should help me better understand what the problem is. If the message file is too large, you could send it to my gmail address (hanshunhua0829).
Thanks a lot!

SergeiF1987 commented 2 years ago

@SergeiF1987 Thanks for opening this issue!

My initial impression is that the issue may be caused by input files naming or formatting, but I need more information to confirm whether this is the case for your run.

For the real data run, could you share the TELR.log file under the TELR output directory?

Are the output messages you posted in after repeatmasker TELR fails with an error "/bin/sh: 1: file_name.realign.sort.bam: not found" #16 (comment) from the complete standard output/errors? If not, could you share the complete standard output/errors with me? These files should help me better understand what the problem is. If the message file is too large, you could send it to my gmail address (hanshunhua0829).

Thanks a lot!

thanks for so quick reply! I have attached the log file. TELR.log

Not sure I understand where to find the complete standard output/errors. My TELR run created only a folder "intermediate_files" and 2 files - "log" and "loci_eval.tsv". I'm able to share everything you need but could you specify what kind of file do you want me to share?

Best, Sergei

shunhuahan commented 2 years ago

It appears that the file names are not the cause of run failure. I tried renaming the input files for the test run based on your real data file names and the new test run finished successfully.
By "standard output/errors" I mean the messages that are automatically generated in the terminal when you run TELR interactively. You could either copy the messages from terminal or redirect the both standard output and error messages into a file when you run TELR (see example below). See more in https://linuxize.com/post/bash-redirect-stderr-stdout/.
```
telr -o test_output -i 101M_passed.part-01.fastq -r 9_genome_contig10.fasta -l dvir_full-size_TEs.fasta &> test.log
```

SergeiF1987 commented 2 years ago

It appears that the file names are not the cause of run failure. I tried renaming the input files for the test run based on your real data file names and the new test run finished successfully.

By "standard output/errors" I mean the messages that are automatically generated in the terminal when you run TELR interactively. You could either copy the messages from terminal or redirect the both standard output and error messages into a file when you run TELR (see example below). See more in https://linuxize.com/post/bash-redirect-stderr-stdout/.
telr -o test_output -i 101M_passed.part-01.fastq -r 9_genome_contig10.fasta -l dvir_full-size_TEs.fasta &> test.log

Ohh, I get it. Here you are - test.log and terminal_output.log terminal_output.log test.log

shunhuahan commented 2 years ago

Thanks for sharing these files @SergeiF1987.
My hypothesis is that the failure has something to do with the contig name in the input reference genome. For example, the contig name in your reference genome file might include strings like >contig_10; instead of >contig10. I was able to reproduce the same error message in your run by adding an ; in the test data.
Let me know if my hypothesis fits your input reference genome file. If so, then you should be able to remove the ; and get the TELR run to complete. I can also try to make an update to allow ; in the contig name.

SergeiF1987 commented 2 years ago

Wow! this one solved. but it comes another one:

ERROR: Requested column 2, but database file /mnt/raid/sergey/bio-first/insertion_analysis/test_output_3/intermediate_files/liftover_report.sort.bed only has fields 1 - 0.

logs are attached. Should I open one more issue? terminal_output.log test.log

shunhuahan commented 2 years ago

Thanks for the report! @SergeiF1987
It's possible that this issue is still caused by the contig name format in the reference genome so we can use the same issue for now. Could you share the contig names extracted from the modified 9_genome_contig10.fasta?
Btw, it may be helpful if I have access to the raw input files through google drive if the issue doesn't turns out to be contig name related.

SergeiF1987 commented 2 years ago

Thanks for the report! @SergeiF1987

It's possible that this issue is still caused by the contig name format in the reference genome so we can use the same issue for now. Could you share the contig names extracted from the modified 9_genome_contig10.fasta?

Btw, it may be helpful if I have access to the raw input files through google drive if the issue doesn't turns out to be contig name related.

sorry for the late reply. Thank you for giving much attention to my issue! raw data is here: https://drive.google.com/drive/folders/1TWYXPI4rPjxzlrl6HJ8-0KhRL-4cLraI?usp=sharing

cbergman commented 2 years ago

On Jan 19, 2022 we ran the most recent TELR on the dataset provided by @SergeiF1987 and got the same error messages that you mentioned in on Oct 12, 2021. We think the issue is caused by no TE being present in the local contig that can be lifted over to the reference genome (in TELR, for every TE candidate, we align flanking sequences in the TELR-assembled local contig to the reference genome to identify precise insertion coordinate). This might be caused by nested TE or adjacent TE or true negative. We will improve the error handling at the lift over step in cases like this. Thanks for providing this useful test case!

bergmanlab / TELR

after repeatmasker TELR fails with an error "/bin/sh: 1: file_name.realign.sort.bam: not found" #16