bhattlab / MGEfinder

A toolbox for identifying mobile genetic element (MGE) insertions from short-read sequencing data of bacterial isolates.
MIT License
105 stars 16 forks source link

Categories of confidence level for the identified insertion sequence #13

Closed Bowmore12 closed 4 years ago

Bowmore12 commented 4 years ago

Hi, durrantmm.

Could you tell us what IAwoC and ArSC mean?

I found the 4 words ( IAwFC, IAwoC, IDB, ArSC) used to specify confidence level of the identified insertion sequence in the file 02.genotype..tsv. In the user manual, IAwFC and IDB are explained, and I could not find the above two.

My purpose is strain genotyping based on polymorphism of the inserted position of MGEs by using resequencing data.

Additionally, if possible, could you recommend or suggest any tools for the analysis of strain genotyping based on 02.genotype..tsv. I especially want to know which strain belongs to which cluster consist of strains harboring an identical MGE profile.

Many thanks for your kind support.

durrantmm commented 4 years ago

Hello!

IAwoC means "Inferred from assembly without context", meaning that it found a predicted element in the assembly, but it wasn't in the expected genomic context. This is common if the element assembled alone on its own contig, without anything flanking it, which often happens when there are multiple copies of the element.

ArSC means "Ambiguous - Resolved by Site Comparison." In some cases, the identity of the element is ambiguous because multiple element clusters share the same terminal ends. MGEfinder then compares that specific genomic site across all other sites in the sample to predict the identity of the inserted element, assuming that it may have been inherited by descent.

I think that the 02.genotype file should work for your purposes. You may want to apply some additional filters to the genotypes. For example, you may want to discard any insertions that only appear in one isolate, or you could use some other allele frequency cutoff. Or you could limit the genotypes only to the elements you are confident are real. I am not sure what you mean by "I especially want to know which strain belongs to which cluster consist of strains harboring an identical MGE profile." Maybe you could try building a phylogenetic tree based on the insertions? You'll have to give me more detail.

Bowmore12 commented 4 years ago

Hi, again.

Thanks for your detailed explanation for IAwoC and ArSC. I could understand them very well.

Yes, I want to reconstruct phylogenetic tree based on the insertions positions, and want to find out clustered-strain groups from the tree.

Following your kind advise, I could build a phylogenetic tree. I extracted inserted positions from 02.genotype and visualized the tree using the following site. http://insilico.ehu.es/ I also can see clustered-strain groups, which may reflect identity of the strains. I hope there are difference between IS-clustering and SNP-clustering and this make sense for my research purpose.

I sincerely appreciate your kind support. Many thanks!

durrantmm commented 4 years ago

Great, glad to hear that!