Differences in read vs. assembly-based results

amilesj commented 1 month ago

Hello,

When running el_gato on ~700 Legionella sequences, we noted that >15% of the results differed based on using reads vs. assemblies as input. The read-based results generally had less Novel alleles and were more phylogenetically coherent than the assembly-based results, so we are proceeding with those as preferred. Is this expected behavior based on your testing? If so, perhaps adding to the documentation that both reads and assemblies can be used as input but that reads are preferred would be beneficial for the community.

Thanks! Arianna

Alan-Collins commented 1 month ago

Hi Arianna,

Thank you for the feedback and for raising this point. Yes, we would expect reads to produce better results. We will revise the language in the readme to reflect that reads should be used when possible.

Thank you! Alan

Alan-Collins commented 1 month ago

Hi Arianna,

Thinking more about your experience with el_gato, we'd like to check to make sure that the differences in your results reflect differences in the underlying assembly vs read information and are not an indication of an issue with el_gato. Would you be willing to share any of the data which produces different results depending on operation mode? If you are unable to share the raw data then we could also try to diagnose the issue using the log files if you are able to share those?

Thanks! Alan

amilesj commented 1 month ago

Hi Alan,

Sure! Here is data on a sampling of the discrepant ones: legionella_elgato_diff_ids.csv

The reads are already publicly available. In case it helps, the included assembly stats came from assemblies done the CDC Phoenix pipeline, v2.1.1.

Best, Arianna

Alan-Collins commented 3 weeks ago

Hi Ariana,

I've gone through the data you provided and tested the samples for which you provided accessions for reads and assemblies. Most of the differences would be expected, but one of the samples revealed that we needed to add a new mapping reference for neuA. The explanation of differences are as follows:

SRS18865210 and SRS19212402 were not discordant when I ran them. I assume the assembly you used that was generated by the Phoenix pipeline differs from the genbank assembly.

SRS18865241, SRS18926887, SRS19212401, SRS19794104, and SRS19794107 were discordant because mompS was either missing or only part of the locus was included in the assembly. The whole locus was represented in the reads.

SRS17427831 had only the secondary locus of mompS in the assembly while the primary locus that is used for SBT was not present in the assembly. The reads included both loci so the reads-based operation correctly typed mompS, while the assembly produced the wrong result.

SRS17427826 assembly only had part of mompS. For reads, the neuA allele was not identified properly because the mapping references we had in our database were not close enough for neuA 215 for our approach to work. I have added another neuA reference to the new release of el_gato version 1.19.0 and you should now find that reads for this sample type properly.

Thank you for sharing those data with us! They are a good example of why we would recommend using reads when possible. Assemblers seem to struggle to figure out how to assemble the two mompS loci. When you run with reads, el_gato can typically resolve the two mompS loci without issue.

appliedbinf / el_gato

Differences in read vs. assembly-based results #18