Closed flass closed 2 years ago
Hi Florent,
This can often be caused by a low quality sample (or more than one) which our QC for poppunk_assign
isn't always good at picking up.
Can you try running the new samples through in smaller batches (of say 10-100) and see if this keeps happening, and if that identifies a problem sample or two?
Hi John, thanks for the suggestion, I'll try that.
Hi John,
I finally found the time to follow up on this.
Because the dataset used above had its own issues (weird patterns of over-estimated distances), I chose to make and use another reference database, using >4000 V. cholerae genomes of diverse origin (but with a large fraction of very similar ones). For all commands below I used executables from PopPUNK v.2.4.0 installed through a conda environment that has, among others, these packages:
poppunk 2.4.0 py39h8884e85_2 bioconda
pp-sketchlib 1.7.4 py39h2d76373_2 conda-forge
This new database was build with commands poppunk --create-db
then poppunk --fit-model
with the following parameters:
--qc-filter prune --retain-failures -D 15 --max-a-dist 0.99 --max-p-idist=0.35 --length-range 3000000 6000000 --min-k 13 --max-k 35 --k-step 2 --sketch-size 100000
and then using poppunk --fit-model refine
.
I obtained 1140 strain clusters in the refined database.
You can see that in this database the distances are completely fine, with a nice a ~ pi
linear relationship.
I then again the same query dataset described above (2214 genomes).
the poppunk_assign
command was run with default prarameters except for --qc-filter 'prune'
, meaning the distance QC uses --max-a-dist 0.5 --max-p-idist=0.5
.
This has again led to a clumping artefact - even though not a complete one as I use to experience with my old database:
Clusters 1,2,4,6,8,9,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,410,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,747,778,867,871,873,875,922,982,1006,1008,1030,1049,1073,1074,1075,1077,1078,1079,1083,1084,1086,1088,1092,1095,1096,1100 have merged into 1_2_4_6_8_9_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_410_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_747_778_867_871_873_875_922_982_1006_1008_1030_1049_1073_1074_1075_1077_1078_1079_1083_1084_1086_1088_1092_1095_1096_1100
That’s still 94 clusters (out of 1140) merging into one, so I doubt this is normal behaviour.
This merger brings together the strains from distinct V. cholerae lineages 7PET (Cluster 1 & 2) and Classical (Cluster 6), indicating it's not just "as it should be".
So there is still a good reason to complain!
then i did as you recommended, and ran the queries in batches of 50.
I indeed it seems that only a few genomes are to blame for the clumping reported above:
1051-1100.log:Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100
1051-1100.log:Clusters 9,410 have merged into 9_410
1101-1150.log:Clusters 9,410 have merged into 9_410
1151-1200.log:Clusters 9,410 have merged into 9_410
1401-1450.log:Clusters 9,410 have merged into 9_410
251-300.log:Clusters 1,1077 have merged into 1_1077
The batch 1051-1100 is responsible alone for the clumping of 86 clusters (interestingly it does not include cluster 2 or 6!) Other mergers 9_410 and 1_1077 are recurring, suggesting these clusters are meant to be merged, as they probably represent closely related strains.
So I ran a more granulated search for culprit genome(s) in batch 1051-1100: Turns out it’s all to be blamed on one single query genome!
1051.log:Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100
1086.log:Clusters 9,410 have merged into 9_410
Query of genome 1051 leads to clumping of 84 clusters. Interesting to see that when done in bulk, the query leads to more clusters being clumped: 1 bad genome alone -> clumping 84 clusters; within 50 genome batch -> clumping 86 clusters; within 2214 genome batch -> clumping 94 clusters.
Looking at the QC metrics for this gnome, indeed it should not be included as length is 2.8 Mbp and the Kraken search shows it’s definitely not a V. cholerae:
Total,3064343
Unclassified,88.34
"Staphylococcus aureus",3.40
[...]
"Vibrio cholerae”,0.03
So it is indeed crucial to only include vetted genomes! but this genome should never have passed the distance QC!
But looking at the log:
Loading previously refined model
Completed model loading
Sketching 1 genomes using 1 thread(s)
Progress (CPU): 1 / 1
Writing sketches to file
Calculating distances using 30 thread(s)
Progress (CPU): 100.0%
Selected type isolate for distance QC is GCA_000016245.1_ASM1624v1
WARNING: Did not find samples to remove:
ELPGI_429
Couldn't find ELPGI_429 in database
Pruned from the database after failing distance QC: ELPGI_429
Network loaded: 1284 samples
Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100
So this genome does induce extreme distances, and although it does not pass the filter and should be removed, this still leads to the clumping artefact. Notice the Error saying “Couldn't find ELPGI_429 in database”. This may hint to a bug where poppunk_assign
attempts to get rid of the problematic genome but searches in the wrong place!
I hope that with this we’re holding a good lead for solving this issue!
Please let me know what you think.
Cheers,
Florent
Hi Florent,
Thanks for the detailed investigation! That's really helpful, and I think I have some ideas about how to better deal with this in future. I'm going to add this on my todo list for v2.5.0, so I hope that in the next release this won't be an issue.
In more detail, I think there are two possible ways this can happen, which may happen separately or together:
For 1) we need to fix the distance QC as you say, check for cluster linking, and too many zero distances. For 2) it is a little more difficult, but I am expecting that similar measures will be able to prune these isolates from the network (but still allowing assignments) without needing a re-fit.
thanks John for that rapid response.
I agree with your analysis suggesting a combo of 2 problems, including something to do with the boundary.
This echoes some of the findings and questions that I and others in my team (Avril, Astrid) have been gathering by experimenting with PopPUNK to investigate this issue (as we all consistently run into it with different and diverse datasets).
As a short summary of our experimentations, I can say that:
It seems that the most likely fix for this clumping issue would be to make the distance QC work properly in poppunk_assign
, so that really bad quality/highly divergent strains are not considered when re-estimating the model and would lead to the strain boundary to be messed up.
So it's great you have put that on your agenda of changes for v2.5.0; we're looking forward to see the effect of the fixes :-) !
We were also thinking : a convenience fix would be that we do not update the model when trying to assign new strains i.e. keep the boundary in the 2-D distance space where it is, and only assign strains to existing or potentially new strain clusters based on their distance profile.
This approach would have the benefit of allowing independent users to classify strains consistently relative to a reference database that may be available publicly, without these genomes having to be included in the updated reference db.
Do you think it would be possible to implement this option for poppunk_assign
? Or to provide a more general fix to this ‘cluster clumping’ issue ?
a note on the test presented in https://github.com/bacpop/PopPUNK/issues/194#issuecomment-1167471623 :
when running poppunk_assign
in batches, I was reusing the same database files. We realised with Avril that the .h5
sketch database file was updated everytime a query is run with poppunk_assign
(file changes in size, even though by not much, and timestamp gets updated). So we were suspecting an incremental effect of change brought by previous queries on further queries.
With that concern in mind, I re-ran the tests but this time bringing a fresh copy of the .h5
sketch database file into the reference database folder so that every batch run would be equal.
I can confirm the results of the test still hold, with the same anomalous genomes causing the same mischief.
However, I noticed results do change slightly! For instance the number of clusters that clump due to the most anomalous genome #1051 changed from 86 to 84, and one other merger occurred on a query genome batch that did not beforehand.
I don’t know how much this is to be ascribed to possible stochastic variables that could make each run unique, or to the difference of using a pristine vs. already queried database.
In any case, this reinforces my opinion that it would be nice to have an option in poppunk_assign
that ensure that the reference database remains as it was before query
This should be fixed in v2.5.0 (and see new docs too). Please feel free to reopen if there are still issues
Hi Nick and John,
I'm back with other worries on using PopPUNK not on Strep pneumo - but it's been a while I've run the below now so maybe this will all have been addresed in version 2.4.0?
Versions poppunk_assign 2.3.0 poppunk_sketch 1.6.2
Command used and output returned
see an excerpt of the log:
Describe the bug Not really a bug, just that I'm puzzled by the output as upon assigning new strains, I've got all the 345 clusters previously defined in the reference database that got merged into a single one! not really helpful for strain classification...
Can you advise on what has gone wrong and how to address it?
note that the reference database was built with the following commands, with options notably to address wide variation in accessory genome among the input set (see previous posts in #135):
Cheers, Florent