Open Sidduppal opened 1 month ago
Hello
Thanks for your question and for using inStrain! The discrepancy you’re seeing is likely due to differences in coverage or read alignment at position 331 between the samples. Low coverage in SLG529 might prevent the gene from being mapped at that position, even if it's mapped nearby.
Here’s what you can try:
1) Check the coverage at position 331 in SLG529 using the .cov file. 2) Confirm that the same -g and -s files were used for both samples. 3) Inspect the alignments at position 331 in a genome viewer (e.g., IGV).
If the issue persists, feel free to share more details!
Best, Matt
Hey @MrOlm, thanks a lot for taking a look.
Unless I'm missing something the .cov
file seems to give the coverage per gene and not for each position within a gene. I'm looking at genes_coverage.csv
inside raw_data
subdirectory generated from the profile
results. For the gene under consideration the coverage is 16.4.
Furthermore, it appears that the position_coverage
at position 331 is equivalent if not higher to surrounding positions where the gene has been mapped correctly.
The same -g
and -s
files were used for both samples as well.
I also want to reiterate that this uneven mapping of genes was present at many other positions. I shared a subset of data but happy to share the full profile results over email, if you think that'll be helpful.
Cheers!
Ah- I see the problem now. In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.
This isn't great behavior and I'll try and fix in the next update, but this is what's going on.
Best, Matt
Hey @MrOlm, I had a similar intuition however, after inspecting the data a bit more I found sites with a gene mapped and allele count more than 2. For example, positions 144 and 650 in the SLG1132.txt file attach above has allele count of 3 but also has genes mapped. Also, these positions seem be identify mutation_type
as well, M and S respectively.
P.S. Issue was closed by mistake. I have re-oped it.
Mutation_type = M means that there are multiple genes at that position (https://instrain.readthedocs.io/en/latest/example_output.html)
In that sample, 650 has a gene mapped and 144 is M?
Both positions 144 and 650 have genes mapped. The mutation_type
is M at position 144 and S at 650.
mutation_type
is M at other positions where allele count is 2.
InStrain thinks 144 has two genes mapped at that location; could that be the case?
650 seems to be working correctly. What is the problem with 650 in that sample?
So there is no problem per se with both of those sites. You mentioned above
In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.
I just pointed out that this is not always the case as there are sites where genes have been mapped and the allele count is greater than 2, eg site 144 and 650. Plus despite having more than 2 alleles inStrain has classified the mutation at these sites (under column mutation_type
).
Hi,
First of all, thank you for developing this amazing tool!
I have encountered a discrepancy in the
inStrain profile
results regarding gene mapping at the same positions between different samples. Specifically:SLG529
, no gene is mapped at position 331 on scaffoldSLG1007_DASTool_bins_61.fa_k141_12596
.SLG1132
, the geneSLG1007_DASTool_bins_61.fa_k141_12596_2
is mapped at the same position (331) on the same scaffold (SLG1007_DASTool_bins_61.fa_k141_12596
).Additionally, in sample
SLG529
, the same gene (SLG1007_DASTool_bins_61.fa_k141_12596_2
) is mapped at positions immediately before and after position 331 (positions 301 and 332), but not at position 331.I ran inStrain profile using the same representative MAGs but with reads from different samples.
This is just one example, but these discrepancies were seen at many other sites as well. I'm unsure whether I'm missing something in my analysis or if this is a potential bug. Any clarification would be greatly appreciated.
SLG529
:inStrain profile SLG529.bam representative_megamag.fa -o SLG529.IS -p 64 -g megamag_representative.fna -s scaffoldtobin.stb --database_mode
SLG1132
:inStrain profile SLG1132.bam representative_megamag.fa -o SLG1132.IS -p 64 -g megamag_representative.fna -s scaffoldtobin.stb --database_mode
I've attached the relevant dataset for your reference.
Thank you!
SLG529.txt SLG1132.txt