All clusters merged into a single one when trying to assign new strains

flass commented 2 years ago

Hi Nick and John,

I'm back with other worries on using PopPUNK not on Strep pneumo - but it's been a while I've run the below now so maybe this will all have been addresed in version 2.4.0?

Versions poppunk_assign 2.3.0 poppunk_sketch 1.6.2

Command used and output returned

poppunk_assign --db 950Vc --query combined_Vc_fasta.txt \
--output combined_Vc_poppunk_clusters --threads 8

see an excerpt of the log:

Sketching 2477 genomes using 8 thread(s)
Writing sketches to file
NOTE: ELPGI_16 required densification
NOTE: ELPGI_128 required densification
NOTE: EM-0272 required densification
NOTE: NHCC-148 required densification
Calculating distances using 8 thread(s)
PopPUNK: assign
        (with backend: sketchlib v1.6.2
         sketchlib: /lustre/scratch118/infgen/team216/fl4/miniconda3/envs/poppunk230/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 8 threads
Mode: Assigning clusters of query sequences

Loading DBSCAN model
WARNING: Accessory outlier at a=0.9157807 1:SAMEA1018328.contigs 2:EB-0206
WARNING: Accessory outlier at a=0.92919415 1:SAMEA1018453.contigs 2:EB-0206
[... total 89161 such warnings about outlier pairs ...]
WARNING: Accessory outlier at a=0.94155586 1:SAMEA2445263.contigs 2:NHCC-138
WARNING: Accessory outlier at a=0.7582968 1:SAMN08979208.contigs 2:NHCC-138
Network loaded: 381 samples
Found novel query clusters. Calculating distances between them.
Clusters 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,13
1,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166
,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,
202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,2
37,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,27
2,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307
,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,
343,344,345 have merged into 1_2_3_4_5_6_7_8_9_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31_32_33_34_35_36_37_38_39_40_4
1_42_43_44_45_46_47_48_49_50_51_52_53_54_55_56_57_58_59_60_61_62_63_64_65_66_67_68_69_70_71_72_73_74_75_76_77_78_79_80_81_82_83_84_85_86_87_8
8_89_90_91_92_93_94_95_96_97_98_99_100_101_102_103_104_105_106_107_108_109_110_111_112_113_114_115_116_117_118_119_120_121_122_123_124_125_12
6_127_128_129_130_131_132_133_134_135_136_137_138_139_140_141_142_143_144_145_146_147_148_149_150_151_152_153_154_155_156_157_158_159_160_161
_162_163_164_165_166_167_168_169_170_171_172_173_174_175_176_177_178_179_180_181_182_183_184_185_186_187_188_189_190_191_192_193_194_195_196_
197_198_199_200_201_202_203_204_205_206_207_208_209_210_211_212_213_214_215_216_217_218_219_220_221_222_223_224_225_226_227_228_229_230_231_2
32_233_234_235_236_237_238_239_240_241_242_243_244_245_246_247_248_249_250_251_252_253_254_255_256_257_258_259_260_261_262_263_264_265_266_26
7_268_269_270_271_272_273_274_275_276_277_278_279_280_281_282_283_284_285_286_287_288_289_290_291_292_293_294_295_296_297_298_299_300_301_302
_303_304_305_306_307_308_309_310_311_312_313_314_315_316_317_318_319_320_321_322_323_324_325_326_327_328_329_330_331_332_333_334_335_336_337_
338_339_340_341_342_343_344_345

Done

Describe the bug Not really a bug, just that I'm puzzled by the output as upon assigning new strains, I've got all the 345 clusters previously defined in the reference database that got merged into a single one! not really helpful for strain classification...

Can you advise on what has gone wrong and how to address it?

note that the reference database was built with the following commands, with options notably to address wide variation in accessory genome among the input set (see previous posts in #135):

poppunk --create-db --r-files 950Vc_genomelist --output 950Vc --threads 8 \
  --min-k 15 --max-k 35 --plot-fit 5 \
  --qc-filter "prune" --length-range "3000000 6000000" --max-a-dist 0.99
poppunk --fit-model "dbscan" --ref-db 950Vc --output 950Vc --threads8 \
  --qc-filter "prune" --length-range "3000000 6000000" --max-a-dist 0.99 --D 350

Cheers, Florent

johnlees commented 2 years ago

Hi Florent,

This can often be caused by a low quality sample (or more than one) which our QC for poppunk_assign isn't always good at picking up. Can you try running the new samples through in smaller batches (of say 10-100) and see if this keeps happening, and if that identifies a problem sample or two?

flass commented 2 years ago

Hi John, thanks for the suggestion, I'll try that.

flass commented 2 years ago

Hi John,

I finally found the time to follow up on this.

Because the dataset used above had its own issues (weird patterns of over-estimated distances), I chose to make and use another reference database, using >4000 V. cholerae genomes of diverse origin (but with a large fraction of very similar ones). For all commands below I used executables from PopPUNK v.2.4.0 installed through a conda environment that has, among others, these packages:

poppunk                   2.4.0            py39h8884e85_2    bioconda
pp-sketchlib              1.7.4            py39h2d76373_2    conda-forge

This new database was build with commands poppunk --create-db then poppunk --fit-model with the following parameters:

--qc-filter prune --retain-failures -D 15 --max-a-dist 0.99 --max-p-idist=0.35 --length-range 3000000 6000000 --min-k 13 --max-k 35 --k-step 2 --sketch-size 100000

and then using poppunk --fit-model refine . I obtained 1140 strain clusters in the refined database.

You can see that in this database the distances are completely fine, with a nice a ~ pi linear relationship. poppunkVC2022_4132genomes_distanceDistribution poppunkVC2022_4132genomes_dbscan poppunkVC2022_4132genomes-refine_refined_fit

I then again the same query dataset described above (2214 genomes). the poppunk_assign command was run with default prarameters except for --qc-filter 'prune', meaning the distance QC uses --max-a-dist 0.5 --max-p-idist=0.5.

This has again led to a clumping artefact - even though not a complete one as I use to experience with my old database:

Clusters 1,2,4,6,8,9,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,410,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,747,778,867,871,873,875,922,982,1006,1008,1030,1049,1073,1074,1075,1077,1078,1079,1083,1084,1086,1088,1092,1095,1096,1100 have merged into 1_2_4_6_8_9_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_410_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_747_778_867_871_873_875_922_982_1006_1008_1030_1049_1073_1074_1075_1077_1078_1079_1083_1084_1086_1088_1092_1095_1096_1100

That’s still 94 clusters (out of 1140) merging into one, so I doubt this is normal behaviour.

This merger brings together the strains from distinct V. cholerae lineages 7PET (Cluster 1 & 2) and Classical (Cluster 6), indicating it's not just "as it should be".

So there is still a good reason to complain!

then i did as you recommended, and ran the queries in batches of 50.

I indeed it seems that only a few genomes are to blame for the clumping reported above:

1051-1100.log:Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100
1051-1100.log:Clusters 9,410 have merged into 9_410
1101-1150.log:Clusters 9,410 have merged into 9_410
1151-1200.log:Clusters 9,410 have merged into 9_410
1401-1450.log:Clusters 9,410 have merged into 9_410
251-300.log:Clusters 1,1077 have merged into 1_1077

The batch 1051-1100 is responsible alone for the clumping of 86 clusters (interestingly it does not include cluster 2 or 6!) Other mergers 9_410 and 1_1077 are recurring, suggesting these clusters are meant to be merged, as they probably represent closely related strains.

So I ran a more granulated search for culprit genome(s) in batch 1051-1100: Turns out it’s all to be blamed on one single query genome!

1051.log:Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100
1086.log:Clusters 9,410 have merged into 9_410

Query of genome 1051 leads to clumping of 84 clusters. Interesting to see that when done in bulk, the query leads to more clusters being clumped: 1 bad genome alone -> clumping 84 clusters; within 50 genome batch -> clumping 86 clusters; within 2214 genome batch -> clumping 94 clusters.

Looking at the QC metrics for this gnome, indeed it should not be included as length is 2.8 Mbp and the Kraken search shows it’s definitely not a V. cholerae:

Total,3064343
Unclassified,88.34
"Staphylococcus aureus",3.40
[...]
"Vibrio cholerae”,0.03

So it is indeed crucial to only include vetted genomes! but this genome should never have passed the distance QC!

But looking at the log:

Loading previously refined model
Completed model loading
Sketching 1 genomes using 1 thread(s)
Progress (CPU): 1 / 1
Writing sketches to file
Calculating distances using 30 thread(s)
Progress (CPU): 100.0%
Selected type isolate for distance QC is GCA_000016245.1_ASM1624v1
WARNING: Did not find samples to remove:
ELPGI_429
Couldn't find ELPGI_429 in database
Pruned from the database after failing distance QC: ELPGI_429
Network loaded: 1284 samples
Clusters 1,8,12,27,55,76,77,80,95,97,121,127,129,133,136,141,145,168,176,183,201,211,221,223,239,287,290,291,292,296,308,313,325,326,329,336,356,366,370,371,399,403,420,435,561,570,572,573,578,579,586,596,606,608,609,615,635,646,651,655,699,728,740,778,867,871,875,922,1006,1008,1030,1049,1073,1074,1075,1078,1079,1083,1084,1086,1088,1092,1095,1100 have merged into 1_8_12_27_55_76_77_80_95_97_121_127_129_133_136_141_145_168_176_183_201_211_221_223_239_287_290_291_292_296_308_313_325_326_329_336_356_366_370_371_399_403_420_435_561_570_572_573_578_579_586_596_606_608_609_615_635_646_651_655_699_728_740_778_867_871_875_922_1006_1008_1030_1049_1073_1074_1075_1078_1079_1083_1084_1086_1088_1092_1095_1100

So this genome does induce extreme distances, and although it does not pass the filter and should be removed, this still leads to the clumping artefact. Notice the Error saying “Couldn't find ELPGI_429 in database”. This may hint to a bug where poppunk_assign attempts to get rid of the problematic genome but searches in the wrong place!

I hope that with this we’re holding a good lead for solving this issue!

Please let me know what you think.

Cheers,

Florent

johnlees commented 2 years ago

Hi Florent,

Thanks for the detailed investigation! That's really helpful, and I think I have some ideas about how to better deal with this in future. I'm going to add this on my todo list for v2.5.0, so I hope that in the next release this won't be an issue.

In more detail, I think there are two possible ways this can happen, which may happen separately or together:

A single/a few 'bad' genome(s) with artificial zero/close-to-zero distances to many other samples, which links much of the network together.
A boundary which is actually a bit lax, and when more isolates are added, they have distances just below the boundary between a few clusters, and there are enough of these to link more and more clusters together.

For 1) we need to fix the distance QC as you say, check for cluster linking, and too many zero distances. For 2) it is a little more difficult, but I am expecting that similar measures will be able to prune these isolates from the network (but still allowing assignments) without needing a re-fit.

flass commented 2 years ago

thanks John for that rapid response.

I agree with your analysis suggesting a combo of 2 problems, including something to do with the boundary.

This echoes some of the findings and questions that I and others in my team (Avril, Astrid) have been gathering by experimenting with PopPUNK to investigate this issue (as we all consistently run into it with different and diverse datasets).

As a short summary of our experimentations, I can say that:

the ‘clumping’ of all clusters into one seems to occur more easily with many genomes in the query set and/or in the initial database, suggesting that increased genome numbers in the dataset increases the probability of the issue arising;
it seems that individual query genomes of poor quality/wrong species drive the issue, even though it is made worse by submitting more query genomes at once (see my response to the GitHub issue)
offending query genomes should not pass the distance QC, but apparently relain in the dataset; it seems that they are detected as failing QC, but what fails is to remove them from the dataset (see my response to the GitHub issue)
increasing the sketch size and step between k-mer sizes has improved the problem, i.e we only run into the clumping issue when adding larger numbers of genomes at once. This also mostly fixes the misestimation of accessory & core distances seen in some reference datasets, although not entirely
given that the big cluster resulting form the multiple mergers groups genomes that are still quite divergent, we understand that the strain cluster boundary in the 2-D distance space has been ‘pushed’ out to high values when the model was re-estimated based on the updated dataset (initial dataset + query genomes)
this could be driven by the offending highly divergent query genomes pulling the distribution of distances to high values

It seems that the most likely fix for this clumping issue would be to make the distance QC work properly in poppunk_assign, so that really bad quality/highly divergent strains are not considered when re-estimating the model and would lead to the strain boundary to be messed up. So it's great you have put that on your agenda of changes for v2.5.0; we're looking forward to see the effect of the fixes :-) !

We were also thinking : a convenience fix would be that we do not update the model when trying to assign new strains i.e. keep the boundary in the 2-D distance space where it is, and only assign strains to existing or potentially new strain clusters based on their distance profile.

This approach would have the benefit of allowing independent users to classify strains consistently relative to a reference database that may be available publicly, without these genomes having to be included in the updated reference db.

Do you think it would be possible to implement this option for poppunk_assign? Or to provide a more general fix to this ‘cluster clumping’ issue ?

flass commented 2 years ago

a note on the test presented in https://github.com/bacpop/PopPUNK/issues/194#issuecomment-1167471623 :

when running poppunk_assign in batches, I was reusing the same database files. We realised with Avril that the .h5 sketch database file was updated everytime a query is run with poppunk_assign (file changes in size, even though by not much, and timestamp gets updated). So we were suspecting an incremental effect of change brought by previous queries on further queries.

With that concern in mind, I re-ran the tests but this time bringing a fresh copy of the .h5 sketch database file into the reference database folder so that every batch run would be equal.

I can confirm the results of the test still hold, with the same anomalous genomes causing the same mischief.

However, I noticed results do change slightly! For instance the number of clusters that clump due to the most anomalous genome #1051 changed from 86 to 84, and one other merger occurred on a query genome batch that did not beforehand.

I don’t know how much this is to be ascribed to possible stochastic variables that could make each run unique, or to the difference of using a pristine vs. already queried database.

In any case, this reinforces my opinion that it would be nice to have an option in poppunk_assign that ensure that the reference database remains as it was before query

johnlees commented 2 years ago

This should be fixed in v2.5.0 (and see new docs too). Please feel free to reopen if there are still issues

bacpop / PopPUNK

All clusters merged into a single one when trying to assign new strains #194