bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
74 stars 8 forks source link

IndexError: list index out of range #41

Open lfearnley opened 5 months ago

lfearnley commented 5 months ago

I am running STRaglr 1.5.0 (the current release) and get the following error on multiple CRAMs:

python straglr.py NA19676.hg38.cram ../1KG_ONT_VIENNA_hg38.fa NA19676.hg38_straglr_patho --loci ../20240328_straglr_catalog.bed 
Traceback (most recent call last):
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 101, in <module>
    main()
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/straglr.py", line 98, in main
    tre_finder.output_vcf(variants, '{}.vcf'.format(args.out_prefix))
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/tre.py", line 1513, in output_vcf
    fails = Variant.find_fails(variants)
  File "/vast/scratch/users/fearnley.l/1KG_ONT_VIENNA/straglr/src/variant.py", line 244, in find_fails
    failed_reason = Counter(failed_reasons).most_common(1)[0][0]
IndexError: list index out of range

Any suggestions as to what might cause this?

readmanchiu commented 5 months ago

The error is caused by the lack of coverage at a given locus. An example case is when there are only 2 support reads for a given locus and each has a different repeat size. And if the min_support is set at 2, no allele can be formulated with minimum support. The new version that produces a VCF output tries to associate a FILTER each failed locus. As I wasn't able to anticipate such scenario, I did not generate a failed reason for such scenario and therefor the script crashed. I have made a fix that would produce a CLUSTERING_FAILED filter for such scenario and will release it shortly. In the meantime, if you want to get past this, you could set --min_cluster_size 1 and the program should be able to finish. Thanks very much for reporting this bug.

zaka-edd commented 5 months ago

Hi, I have been having the same issue. I tried changing --min_cluster_size 1, but it did not fix the error for me. Do you know another problem that could be the cause. I ran:

straglr.py map-sminimap2-HG002_hg38_chr21.bam .../chr21_test_data/chr21.fa output_straglr --loci HG002_repeats_straglr.bed --min_cluster_size 1


  Traceback (most recent call last):
    File "/usr/local/bin/straglr.py", line 101, in <module>
      main()
    File "/usr/local/bin/straglr.py", line 93, in main
      variants = tre_finder.genotype(args.loci)
    File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1426, in genotype
      return self.collect_alleles(loci)
    File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1402, in collect_alleles
      tre_variants = self.get_alleles(loci)
    File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1252, in get_alleles
      self.update_refs(variants, genome_fasta)
    File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 1271, in update_refs
      refs = self.extract_refs_trf(trf_input)
    File "/usr/local/lib/python3.10/site-packages/src/tre.py", line 607, in extract_refs_trf
      data_motif = cols[3]
  IndexError: list index out of range
readmanchiu commented 5 months ago

This is a different problem. Looks like there is something wrong when the script parsed the results from the TRF run. Can you try running with --tmpdir <path> --debug, where <path> can be set to your output directory. This way the temporary files will be kept. I want to see if there is anything wrong with the latest ***.dat (TRF output) created. You can first check the TRF output is there. If you only have a few loci, maybe you can post the content of the .dat file? Or you can attach the file for me to examine. Best if you can start a new issue for this.