appliedbinf / el_gato

MIT License
9 stars 1 forks source link

IndexError: list index out of range #13

Closed himamura2 closed 8 months ago

himamura2 commented 9 months ago

Dear el_gato developper, I was able to run el_gato fastq for many samples up to now but now one sample produce very odd errors. The depth and mapq of the sample is better than other completed samples. Also el_gato assembly method worked and its ST was Sample ST flaA pilE asd mip mompS proA neuA_neuAH Leg1373_S19.contigs 23 2 3 9 10 2 1 6

Have you seen the error shown below?? best, Hideo Imamura UZB ====The command==== el_gato.py --threads 4 --read1 Microbiology-Leg1375_S31_R1_001.fastq.gz --read2 Microbiology-Leg1375_S31_R2_001.fastq.gz --depth 0 --out Leg1375_gfqd0 -w

rname numreads covbases coverage meandepth meanbaseq meanmapq

flaA 2752 181 100 1251.98 35.7 59.3
pilE 4029 332 100 1269.2 35.7 59.3
asd 5409 472 100 1299.68 35.7 59.7
mip 5006 401 100 1363.97 35.6 59.6
mompS 8690 351 100 2599.56 35.7 59.6
proA 4881 405 100 1307.3 35.3 59.6
neuA 4526 353 100 1367.97 35.6 59.5
neuAh 0 0 0 0 0 0
neuA_207 0 0 0 0 0 0
neuA_211 0 0 0 0 0 0
neuA_212 0 0 0 0 0 0

[01/09/2024 12:14:01 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] Finished running samtools coverage [01/09/2024 12:14:01 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] minimum depth of flaA locus is 1088. [01/09/2024 12:14:01 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] minimum depth of pilE locus is 1103. [01/09/2024 12:14:02 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] minimum depth of asd locus is 1067. [01/09/2024 12:14:02 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] minimum depth of mip locus is 1174. [01/09/2024 12:14:02 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] minimum depth of mompS locus is 2249.

el_gato worked up to here and then produced the following errors

==== Error messages =====

Traceback (most recent call last): File "/home/ngs/.conda/envs/elgato/bin/el_gato.py", line 1848, in main() File "/home/ngs/.conda/envs/elgato/bin/el_gato.py", line 1834, in main output = choose_analysis_path(inputs, Ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ngs/.conda/envs/elgato/bin/el_gato.py", line 1677, in choose_analysis_path alleles = map_alleles(inputs=inputs, ref=ref) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ngs/.conda/envs/elgato/bin/el_gato.py", line 1473, in map_alleles alleles = process_reads(contig_dict, read_info_dict, ref, outdir, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ngs/.conda/envs/elgato/bin/el_gato.py", line 1350, in process_reads agreeing_calls.append(calls[0])


IndexError: list index out of range

== the output folder files ==
reads_vs_all_ref_filt.sam
reads_vs_all_ref_filt_sorted.bam
reads_vs_all_ref_filt_sorted.bam.bai
intermediate_outputs.txt
run.log
Alan-Collins commented 9 months ago

Hi Hideo,

Looking at that error message I think this might be something we fixed in a recent update after someone else reported a similar issue. Would you mind updating to version 1.15.1 and rerunning to see if you still get an error?

Thanks, Alan

himamura2 commented 9 months ago

Thank you Alan, I installed version 1.15.2 via conda but I got el_gato version: 1.14.4 on the terminal but I assume it is updated. Then it produced the following message

[01/09/2024 02:29:59 PM | /nexus/Analysis/microbiology/231222_A00154_1419_BHWKYCDSX7/legGato_231222_1419/Leg1373_gfqd0 ] ERROR: 3 well-supported mompS alleles identified and can't be resolved. Aborting.

It said 3 mompS alleles but did not show which ones. Is it possible to see these predictions? Now, is it good to trust el_gato assembly result:

Sample ST flaA pilE asd mip mompS proA neuA_neuAH Leg1373_S19.contigs 23 2 3 9 10 2 1 6

best, Hideo

Alan-Collins commented 9 months ago

That message indicates that there is at least one position in the mompS locus that has 3 different combinations of base calls found at 2 positions. In a case like that, el_gato does not attempt to resolve the possible sequences that are present and so does not report the three possible alleles.

If you would like to further explore the three alleles that seem to be present in your sample, you can inspect the reads. You should have a file called "reads_vs_all_ref_filt.sam" in the output directory for that sample. That file contains the reads that map to the reference sequences used for allele calling (el_gato/db/ref_gene_regions.fna). You could use those two files and a tool like IGV to view the biallelic sites and assess whether there may be a contamination in your sample.

el_gato is only able to produce ST predictions based on the data you give it. Assemblies may produce less accurate ST predictions as assemblies do not always include all the information that was present in the reads. It will be necessary to compare the alleles indicated by the reads to those present in the assembly to know for sure.

However, it sounds like your reads indicate that there may be 3 distinct mompS alleles present, while your result from using the assembly indicates that at most 2 were present in the assembly. Perhaps the assembler you used simply chose the majority base call and did not include one of those possible alleles?

One case that we have seen before with more than 2 mompS alleles was the result of a low-level contamination with a different Lp strain. The primary sample had 2 mompS alleles (~50% ratio in read pairs indicating each of these alleles) and the contaminant has a single mompS allele and represented ~10-20% of the reads.