MBHallgren / MINTyper

6 stars 0 forks source link

get SNPs that are taken into account when performing the distance matrix #8

Closed alexandreflageul closed 5 months ago

alexandreflageul commented 5 months ago

Hi everyone. First, thank you for the tool, it has been very usefull.

Recently I get a question from one biologist to whome I was showing the distance matrix of several bacterial strain, in which two strains were distant from 50 SNP and two other strains were also distant from 53 SNP. And he asked me : "can you give me the location of the SNP in these two comparisons ?" I said: "Sure, no problem" and I used the vcf files that were produced by mintyper and I tried to extract the SNPs that were different from the two vcf files, but I was not able to do it. When I compare two vcf files together, the difference between the two were not 50 SNP, but rather 400 SNP. Reading the paper again helped to understand that the tool is removing a certain amout of data and the distance matrix is made after, but how can I get access to the SNP that are taking into account in the result ?

Alexandre

alexandreflageul commented 5 months ago

@MBHallgren

MBHallgren commented 5 months ago

Hi Alexandre,

Thanks for using the tool :)

Yes, this is indeed an issue, thanks for pointing it out. The VCFs produced are outputted during the alignment against the reference template, which is not convenient if some of these variant positions are trimmed/filtered out during the distance matrix calculation. Also, an important note: These are the variant positions between each sample and the reference, not between the individual samples.

I can get ccPhylo (the trimming and filtering tool) to output the variants included in the matrix in this format:

(1, 0) A2398665G (2, 0) T2397011C (2, 0) T3237431A (2, 0) T3764039G (2, 1) G2395368A (2, 1) T2398235C (2, 1) T3238924A (2, 1) T3766085G

I'll write a function that replaces the coordinates with the sample names to make it more readable. Hopefully, this solves your issue :)

On another note, what bacterial species are you working with? I've been developing another phylogenetic tool that uses core genes only to estimate phylogeny. If the species you are researching is on this list (https://www.cgmlst.org/ncs), please let me know if you want to try out the new tool.

Best regards, Malte

MBHallgren commented 5 months ago

I have released version 1.1.2 here on Github and in Pypi (pip install mintyper). It produces a file matrix_SNVs.txt outputs the variant positions between the samples in the distance matrix.

I hope this was what you were looking for, else let me know :)

Best regards, Malte

alexandreflageul commented 5 months ago

Hi, thanks a lot Malte. I'll try the new version.

Regarding the other phylogenetic tool, mostly I'm working with salmonella, e coli, but this specific case was about Enterococcus cecorum. I like working with SNP, rather than core genomes, because it allows me to be more precise in the conclusion I made. For exemple, I had a case where the several strains of salmonella had the same complexe type, but after SNP analysis I was able to make more precise clusters than just looking at core genomes.

Again, thank you for the tool modification.

MBHallgren commented 5 months ago

Hi Alexandre,

That sounds interesting! The other tool is also SNP-based, but it only derives phylogeny based on the SNPs in the core genomes. That way, we avoid incorrectly or accidentally including any mobile element that might have inserted itself into one of the sequences. Naturally, the downside is that we only consider ~25% of the genome, but often, this is plenty to precisely estimate the phylogenetic relationship.

Best regards, Malte

alexandreflageul commented 5 months ago

So, I tested the new version, and yes, that is what I wanted, so thank you.

I would be glad to test the new tool.

alexandreflageul commented 5 months ago

Do you mind to add a bit of extra information about how the SNPs are filtered out by ccPhylo in the README please ? Even after reading the paper, it is a bit cloudy...

MBHallgren commented 5 months ago

I'll add this to my to-do list and hopefully get around to it soon :)

Best regards, Malte

alexandreflageul commented 4 days ago

Hello Again @MBHallgren I hope you are well. I come to you again regarding my last demand on amending the README file with more explanations on how SNPs are selected to include them in the final distance matrix. Again, explanations are not very clear in the paper, and I need more explanations to understand confusing results obtained with two slightly different datasets of iontorrent reads I am working with. Best Regards, Alexandre Flageul