bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

ValueError: invalid coordinates #31

Closed jyw-atgithub closed 3 months ago

jyw-atgithub commented 4 months ago

Dear @readmanchiu, I am using latest straglr.py under python-3.10.2 and GNU/Linux x86_64. Here is the commend python3 straglr.py ${aligned_bam}/SRR9951099_ONT.trimmed-ref.SOFT.bam ${ref_genome} SRR9951099_ONT --nprocs 16 --min_ins_size 50 --max_str_len 100 Then, it produced the following error. May I know how to fix it. Thank you!

multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1135, in get_alleles
    rescued = self.rescue_missed_clipped(missed_clipped, genome_fasta)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 899, in rescue_missed_clipped
    pstart, pend, pseq = self.get_probe(clipped_end, locus, genome_fasta)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 864, in get_probe
    pseq = genome_fasta.fetch(locus[0], max(0, pstart), min(pend, genome_fasta.get_reference_length(locus[0])))
  File "pysam/libcfaidx.pyx", line 288, in pysam.libcfaidx.FastaFile.fetch
  File "pysam/libcutils.pyx", line 256, in pysam.libcutils.parse_region
ValueError: invalid coordinates: start (7740) > stop (3402)
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pub/jenyuw/Software/straglr/straglr.py", line 93, in <module>
    main()
  File "/pub/jenyuw/Software/straglr/straglr.py", line 82, in main
    variants = tre_finder.examine_ins(ins, min_expansion=args.min_ins_size)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1254, in examine_ins
    variants = self.collect_alleles(merged_loci)
  File "/dfs6/pub/jenyuw/Software/straglr/src/tre.py", line 1288, in collect_alleles
    batched_results = parallel_process(self.get_alleles, batches, self.nprocs)
  File "/dfs6/pub/jenyuw/Software/straglr/src/utils.py", line 21, in parallel_process
    results = p.map(func, args)
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/pathos/multiprocessing.py", line 154, in map
    return _pool.map(star(f), zip(*args), **kwds)
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/data/homezvol2/jenyuw/.local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
ValueError: invalid coordinates: start (7740) > stop (3402)

Here is how I produced the alignment:

minimap2 -t ${nT} -a -x ${mapping_option[$read_type]} -Y \
${ref_genome} ${file} |\
samtools view -b -h -@ ${nT} -o - |\
samtools sort -@ ${nT} -o ${aligned_bam}/${name}.trimmed-ref.SOFT.bam
samtools index -@ ${nT} ${aligned_bam}/${name}.trimmed-ref.SOFT.bam
readmanchiu commented 4 months ago

thanks for reporting the error. It happened in a step where Straglr is trying to extract a clipped sequence from an alignment and align it against the reference to see if it hit anything with a hope to "rescue" the alignment for further analysis. Unfortunately the start and end coordinates of do not make sense in this instance, off by quite a large margin. It's hard to debug without doing a full analysis, but I will give it a shot if you can re-run again by adding: --debug --tmpdir {dirname} to your command, where for {dirname} please specify a custom empty directory to store the tmp files. Please direct the stdout from running Straglr to a file so that you can send it to me together with a compressed file of the tmp files.

readmanchiu commented 4 months ago

Hi @jyw-atgithub, I've create a branch https://github.com/bcgsc/straglr/tree/rescue Could you please check out this branch on your data to see if the bug still occurs? Thanks!

jyw-atgithub commented 4 months ago

Hello! Thanks for your support! I will upload the subset data after the conference (TAGC2024). In addition, I tried python 3.8 on our cluster and no error was reported but the same error was replicated under Python 3.10.2

readmanchiu commented 4 months ago

The bugfix will kind of "escape" the problem and let the software move on To understand what caused the problem I will need to see the data as I cannot think of how it happened unless there is some unpredicted alignment scenario. Is is a human genome alignment or some other species? Anyways, wish you a very fruitful conference experience.

readmanchiu commented 3 months ago

@jyw-atgithub, wonder if you got a chance to try the bugfix in the branch "rescue"

jyw-atgithub commented 3 months ago

Hi @readmanchiu , I am working on it right now. Thanks for the follow up. I will tell you the results tomorrow if everything goes fine.

jyw-atgithub commented 3 months ago

Hi @readmanchiu , Thank you very much for your patience!! In my sandbox environment, the error message did not occurr. The commands remain the same. The following shows my environment. It is operated on our school's public cluster.

$module load anaconda/2022.05
$conda activate sandbox
(sandbox) $python --version
Python 3.9.12
(sandbox) $which trf
~/.conda/envs/sandbox/bin/trf
(sandbox) $which blastn
~/.conda/envs/sandbox/bin/blastn
(sandbox)  $conda --version
conda 4.12.0
(sandbox)  $cd straglr/
(sandbox)  $git branch -a
* rescue
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/rescue
  remotes/origin/v1.2.0a

The straglr was installed by pip install . --user in the directory pulled from github

readmanchiu commented 3 months ago

Good to know the error is gone. I will merge the branch.