MrOlm / inStrain

Bioinformatics program inStrain
MIT License
149 stars 33 forks source link

Discrepancy in gene ID mapping on same positions in different samples #195

Open Sidduppal opened 1 month ago

Sidduppal commented 1 month ago

Hi,

First of all, thank you for developing this amazing tool!

I have encountered a discrepancy in the inStrain profile results regarding gene mapping at the same positions between different samples. Specifically:

Additionally, in sample SLG529, the same gene (SLG1007_DASTool_bins_61.fa_k141_12596_2) is mapped at positions immediately before and after position 331 (positions 301 and 332), but not at position 331.

I ran inStrain profile using the same representative MAGs but with reads from different samples.

This is just one example, but these discrepancies were seen at many other sites as well. I'm unsure whether I'm missing something in my analysis or if this is a potential bug. Any clarification would be greatly appreciated.

I've attached the relevant dataset for your reference.

Thank you!

SLG529.txt SLG1132.txt

MrOlm commented 1 week ago

Hello

Thanks for your question and for using inStrain! The discrepancy you’re seeing is likely due to differences in coverage or read alignment at position 331 between the samples. Low coverage in SLG529 might prevent the gene from being mapped at that position, even if it's mapped nearby.

Here’s what you can try:

1) Check the coverage at position 331 in SLG529 using the .cov file. 2) Confirm that the same -g and -s files were used for both samples. 3) Inspect the alignments at position 331 in a genome viewer (e.g., IGV).

If the issue persists, feel free to share more details!

Best, Matt

Sidduppal commented 1 week ago

Hey @MrOlm, thanks a lot for taking a look.

Unless I'm missing something the .cov file seems to give the coverage per gene and not for each position within a gene. I'm looking at genes_coverage.csv inside raw_datasubdirectory generated from the profile results. For the gene under consideration the coverage is 16.4.

Furthermore, it appears that the position_coverage at position 331 is equivalent if not higher to surrounding positions where the gene has been mapped correctly.

The same -g and -s files were used for both samples as well.

I also want to reiterate that this uneven mapping of genes was present at many other positions. I shared a subset of data but happy to share the full profile results over email, if you think that'll be helpful.

Cheers!

MrOlm commented 1 week ago

Ah- I see the problem now. In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.

This isn't great behavior and I'll try and fix in the next update, but this is what's going on.

Best, Matt

Sidduppal commented 1 week ago

Hey @MrOlm, I had a similar intuition however, after inspecting the data a bit more I found sites with a gene mapped and allele count more than 2. For example, positions 144 and 650 in the SLG1132.txt file attach above has allele count of 3 but also has genes mapped. Also, these positions seem be identify mutation_type as well, M and S respectively.

P.S. Issue was closed by mistake. I have re-oped it.

MrOlm commented 1 week ago

Mutation_type = M means that there are multiple genes at that position (https://instrain.readthedocs.io/en/latest/example_output.html)

In that sample, 650 has a gene mapped and 144 is M?

Sidduppal commented 1 week ago

Both positions 144 and 650 have genes mapped. The mutation_type is M at position 144 and S at 650. mutation_type is M at other positions where allele count is 2.

MrOlm commented 1 week ago

InStrain thinks 144 has two genes mapped at that location; could that be the case?

650 seems to be working correctly. What is the problem with 650 in that sample?

Sidduppal commented 1 week ago

So there is no problem per se with both of those sites. You mentioned above

In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.

I just pointed out that this is not always the case as there are sites where genes have been mapped and the allele count is greater than 2, eg site 144 and 650. Plus despite having more than 2 alleles inStrain has classified the mutation at these sites (under column mutation_type).