Query on output generated in results section

ShriramHPatel commented 3 years ago

Hi,

It was very nice to read and understand the STRONG paper. I have run it on some of my samples, and I am trying to understand the output generated in "mag_and_haplo_cov.tsv" file in results section.

I have some specific question as below:

My understanding is that MAG coverage are normalized by median SCG coverage. Am I right? Because I am analyzing very low diverse environment and low coverage observed for MAG made me think twise.
is it surprising that coverage of haplotypes are higher than the MAG haplotypes? This is bit confusing to me. For example, in below table Bin_1_haplo_0 coverage (1.4) is higher than Bin_1 coverage (0.7). Can you provide more insights into normalization methods used here?
Also do you recommend using haplotype abundance (Bin_*F_intensity file) directly generated by BayesPaths by converting Intensity to coverage by multiplication with read length?

Apologies for the basic questions, this has been a great tool but before getting excited I wanted to be sure I understand the output I'm working with!

Thanks, Shriram

/	sample1	sample10	sample11	sample12	sample13	sample14	sample15	sample16	sample17	sample18	sample19
Bin_1	0.7	0.7	0.75	0.00021	0.00027	0.00019	0.00021	0.00024	0.00019	0.00029	0.00025
Bin_1_haplo_0	1.4	1.4	1.5	NA	NA	NA	NA	NA	NA	NA	NA
Bin_1_haplo_1	0.5	0.71	0.62	NA	NA	NA	NA	NA	NA	NA	NA

Sebastien-Raguideau commented 3 years ago

Hi Sriram,

Yes that is the case. You can get the exact value in the profile folder in the Normalisation.tsv file.
What you're seeing there is the discrepancy between calculating coverage from read mapping on the whole MAG versus coverage derived from the kmer coverage on SCGs. While both should give you concordant values we consistently see that the sum of haplotypes coverage is higher than that of the MAG. Exactly where this come from is hard to tell at the moment, it could be just from estimating coverage from just the SCGs instead of the whole MAG.
You could do that, but you would just be repeating what we do when we generate "mag_and_haplo_cov.tsv". Also, Intensity*read_length would give you kmer coverage and not read coverage. It would be un-normalised too.

These are completely legitimate question and I hope this made things clear :)

Best, Seb

ShriramHPatel commented 2 years ago

Hi,

Thank you for your detailed reply, and it is very useful. I have few more related queries.

Looking at the below table, we identified 6 haplotypes for Bin18. Although coverage of Bin18 in sample1 is 0.46, identified haplotypes have very low coverage. In contrast, similar haplotype coverage noted for sample2 despite Bin18 has somewhat lower coverage (0.1). While in sample3 we note that haplotype5 is most prevalent. Am I right if I interpret that all 6 haplotypes of Bin18 are present in samples. Or is there any general coverage cut-off for haplotypes to consider as present should be applied?

/	sample1	sample2	sample3
Bin_18	0.46	0.1	0.54
Bin_18_haplo_0	0.0031	0.046	0.012
Bin_18_haplo_1	0.02	0.07	0.011
Bin_18_haplo_2	0.014	0.029	0.006
Bin_18_haplo_3	0.013	0.063	0.0041
Bin_18_haplo_4	0.007	0.033	0.0036
Bin_18_haplo_5	0.012	0.013	1.4

Since BayesPaths algorithm depends on differences in strain profiles across samples in order to link unitigs into haplotypes, can STRONG be used in a case when samples are un-releated but derived from same niche (for example vaginal microbiome)?
We have also access to the genomes cultured and sequenced from the same metagenomic samples. Can evaluation step of STRONG be used to link individual haplotypes identified to the dataset of genomes? If yes, is there any specific input requirement?

Many thanks, Shriram

Sebastien-Raguideau commented 2 years ago

Hi Shriram,

I'm not convinced from that table that coverage are low since this table is normalised coverage. If we had actual coverage were 0.0036 means that only 0.36% of the scg are covered, I would be worried too. You can have a look at un-normalised coverage by re-multiplying these number with values in the profile/Normalisation.tsv file. I suppose that the concern behind your question is linked to uncertainty relative to coverage estimation. We do have such information and they are stored in the bayespath//F_varIntensity.csv files. Bars in the haplotypes_coverage.pdf are derived from it.
This is a good question and I must admit I myself didn't test STRONG on non time-series datasets. It will boil down to fraction of shared haplotypes between samples. If none are shared, while having more than one per sample, you will see similar results to running STRONG on a per sample basis. It's unclear if haplotypes will be well separated. I would try that anyway, from other biomes, we'we seen mags at 99% ANI accross numerous different patients/reactors.
This is a mode we used for our own evaluation of STRONG and it is poorly documented at the moment. and even more unweildy than the rest of the pipeline :) I will update the documentation shortly on how to use it.

Best, Seb

Sebastien-Raguideau commented 2 years ago

Hi Shriram, I added a paragraph on exact requirement for the evaluation part of the pipeline to run. This part needs refactoring. Best, Seb

ShriramHPatel commented 2 years ago

I appreciate your time and effort on answering my queries. And thanks a million for updating the manual; usage instructions are very clear now. Regards, Shriram

chrisquince / STRONG

Query on output generated in results section #111