chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
44 stars 9 forks source link

Question about comparing haplotypes generated between two different STRONG runs #116

Closed ShriramHPatel closed 2 years ago

ShriramHPatel commented 2 years ago

Hi,

We have somewhere between 6-12 longitudinal samples from multiple subjects and we are planning to run STRONG on per subject basis for haplotype resolution. And after that in order to get insights on which haplotypes are shared between subjects we are planning to use/ compare "concatenated_cogs.msa" (or something along the lines from result/{mag}/tmp section) on the MAGs classified to same species.

We know that this could get complicated and we may have a go too long to accomplish this. So it would be great to know how do you advise/ suggest to proceed in that case?

We are not looking to co-assemble all samples because A) some of the subjects could have completely distinct microbiome in them and also B) co-assembly of >400 samples would be very resource intensive.

On a releated note, I have seen that some of the COGs (in result/{mag}/tmp) have no haplotype resoluted in them. Does that mean sequences for those COGs are identifical for the haplotypes identified on other COGs?

Apologies for the unreleated questions, but I think that others with the same question will potentially benefit from your advice!

Thank you very much, Shriram

Sebastien-Raguideau commented 2 years ago

Hi Shriram,

This does seems like a sensible plan.

Though I would not use directly the Concatenated_cogs.msa file since it does not contains all of the COGs. Ideally you would want a file with all 36 COGs always in the same order with missing COGs replaced by a series of N the size of this COG.

Instead I would start from haplotypes_cogs.fna to regenerate something like Concatenated_cogs.msa but with Ns for missing COGs. Mapping of COG to AA length can be found here.

It's possible for COGs to be ignored by bayespath while still being annotation in the MAG. It can be because the COG was too small, or present with no variants. Current pipeline only focus on sequence resolved by bayespath and will ignore other. So, yes if you want to obtain more COGs sequences, it is a good idea to check on the tmp folder and take the COG sequence from the mag.

Best, Seb

ShriramHPatel commented 2 years ago

That was really helpful! Thanks a mil. Shriram