bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
69 stars 8 forks source link

IndexError: list index out of range #43

Closed zaka-edd closed 4 months ago

zaka-edd commented 4 months ago

I am trying to run STRaglr 1.5.0 (current release) from a docker container, on ONT data with a catalog of TR regions for chr21 as the input bed file. I kept getting an error so I tried playing around with some parameters. However I kept getting the same error. The command I ran looked like this:

straglr.py map-sminimap2-HG002_hg38_chr21.bam ../chr21/chr21.fa output_straglr --loci HG002_repeats_straglr.bed --min_support 1 --min_ins_size 0 --min_cluster_size 0 --nprocs 5

However I get this error:

multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.10/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1252, in get_alleles
    self.update_refs(variants, genome_fasta)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1271, in update_refs
    refs = self.extract_refs_trf(trf_input)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 607, in extract_refs_trf
    data_motif = cols[3]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/straglr.py", line 101, in <module>
    main()
  File "/usr/local/bin/straglr.py", line 93, in main
    variants = tre_finder.genotype(args.loci)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1426, in genotype
    return self.collect_alleles(loci)
  File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1398, in collect_alleles
    batched_results = parallel_process(self.get_alleles, batches, self.nprocs)
  File "/usr/local/lib/python3.10/site-packages/src/utils.py", line 21, in parallel_process
    results = p.map(func, args)
  File "/usr/local/lib/python3.10/site-packages/pathos/multiprocessing.py", line 154, in map
    return _pool.map(star(f), zip(*args), **kwds)
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
IndexError: list index out of range

I tried running a debug like

straglr.py map-sminimap2-HG002_hg38_chr21.bam  ../chr21/chr21.fa output_straglr  \
 --loci ../chr21/chr21.fa  \
 --min_support 1 \
 --min_ins_size 0 \
 --min_cluster_size 0 \
 --nprocs $CORES \
 --tmpdir $DEBUG_DIR_STRAGLR/tmp/ \
 --debug > $DEBUG_DIR_STRAGLR/debug.log 2>&1

Any suggestions on what could cause the error? I included all the debug files and the input region bed file. chr21_regions.bed.gz debug.log.gz tmp.tar.gz

readmanchiu commented 4 months ago

Thanks for trying Straglr and for providing the files, it really helps debugging. I think the culprit is a few of the loci have very long motif, which cause problem for TRF as Straglr incorporates the motif sequence in the sequence header and TRF has an upper length limit for sequence headers. You can try removing the loci with over 100bp motif and that should probably solve the problem. Also, Staglr is not meant for genotyping homopolymer, so you should also get rid of the homompolyer loci (including the homopolymers will cause problems) You can use awk to do this:

awk 'length($4)<=100 && length($4)>=2' chr21_regions.bed > new_chr21_regions.bed

Also, you should use --min_cluster_size 2 and --min_support 2, having at least 2 reads to support a genotype call makes more sense

--min_support 2 --min_cluster_size 2 --max_str_len 100 --min_str_len 2 

Note that setting --max_str_len won't solve the TRF problem, you have to manually eliminate the long-motif loci first, like using the awk command as suggested

zaka-edd commented 4 months ago

Thank you for your help! I will make sure to exclude these regions.