bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
89 stars 18 forks source link

Can't fit the model for Salmonella Enteritidis #53

Closed apredeus closed 4 years ago

apredeus commented 4 years ago

Hello John,

thank you for a very interesting tool. I'm trying to fit a model for about ~ 3000 genomes of specific Salmonella serovar (Enteritidis). Problem is, it seems, that its pangenome is not that diverse, and model fit does quite terribly. Default settings (K=2) produced fit with score of 0.0604. Refinement failed with no error message, and DBSCAN also just fails. Fitting model with K=3 generates the following parameters; however, it identifies 82 clusters which seems excessive?

Fit summary: Avg. entropy of assignment 0.0060 Number of components used 3

Scaled component means: [0.06682449 0.18523007] [0.0311445 0.06276067] [0.59918419 0.36938257]

Warning: trying to create very large network Network summary: Components 82 Density 0.4531 Transitivity 0.8935 Score 0.4887

I'd appreciate any advice on how to get it to work. Thank you! I've attached the accessory/core genome plot.

image

nickjcroucher commented 4 years ago

Hi Alex,

Based on the distribution, it might be worth trying a higher K with the 2D Gaussian (e.g. K=10), although DBSCAN would probably be better – is there any error message when DBSCAN fails? Have you tried altering the DBSCAN parameters? Looking at the distribution, it may be worth refining the model in the core only mode (--indiv-refine).

Nick.

From: Alex Predeus notifications@github.com Reply to: johnlees/PopPUNK reply@reply.github.com Date: Friday, 18 October 2019 at 20:46 To: johnlees/PopPUNK PopPUNK@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [johnlees/PopPUNK] Can't fit the model for Salmonella Enteritidis (#53)

Hello John,

thank you for a very interesting tool. I'm trying to fit a model for about ~ 3000 genomes of specific Salmonella serovar (Enteritidis). Problem is, it seems, that its pangenome is not that diverse, and model fit does quite terribly. Default settings (K=2) produced fit with score of 0.0604. Refinement failed with no error message, and DBSCAN also just fails. Fitting model with K=3 generates the following parameters; however, it identifies 82 clusters which seems excessive?

I'd appreciate any advice on how to get it to work. Thank you! I've attached the accessory/core genome plot.

[image]https://user-images.githubusercontent.com/7825825/67123455-442cbc80-f1e8-11e9-932b-5d5ab0114f76.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/johnlees/PopPUNK/issues/53?email_source=notifications&email_token=AD6PWQ6OWZJ6IHA74HYFUBTQPIHB3A5CNFSM4JCLSBL2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HS26UVQ, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6PWQ27FAQ2ATY2NGWY7L3QPIHB3ANCNFSM4JCLSBLQ.

apredeus commented 4 years ago

Hello Nick,

thank you for your reply. All I get (aside from some deprecation warnings) is the message

Failed to find distinct clusters in this dataset

Are there any other options to use on the model fit stage? I'm getting the same error for the selection of Typhimuriums I'm processing.

nickjcroucher commented 4 years ago

Hi Alex,

You can try altering the constraints on the DBSCAN fit (judging from the distribution, increasing --min-cluster-prop might be helpful):

Model fit options: --K K Maximum number of mixture components [default = 2] --dbscan Use DBSCAN rather than mixture model --D D Maximum number of clusters in DBSCAN fitting [default = 100] --min-cluster-prop MIN_CLUSTER_PROP Minimum proportion of points in a cluster in DBSCAN fitting [default = 0.0001]

Perhaps the best approach would be to add in other Salmonella genomes – even a small number of representatives of other serotypes should serve as effective outgroups in the analysis, and shouldn’t increase the runtime much.

From: Alex Predeus notifications@github.com Reply to: johnlees/PopPUNK reply@reply.github.com Date: Monday, 21 October 2019 at 13:46 To: johnlees/PopPUNK PopPUNK@noreply.github.com Cc: "Croucher, Nicholas J" n.croucher@imperial.ac.uk, Comment comment@noreply.github.com Subject: Re: [johnlees/PopPUNK] Can't fit the model for Salmonella Enteritidis (#53)

Hello Nick,

thank you for your reply. All I get (aside from some deprecation warnings) is the message

Failed to find distinct clusters in this dataset

Are there any other options to use on the model fit stage? I'm getting the same error for the selection of Typhimuriums I'm processing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/johnlees/PopPUNK/issues/53?email_source=notifications&email_token=AD6PWQ45LB4NTKTGN4WISE3QPWQAXA5CNFSM4JCLSBL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB2GB2I#issuecomment-544497897, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6PWQ6TN7QSFNQXO53IEUTQPWQAXANCNFSM4JCLSBLQ.

apredeus commented 4 years ago

I've tried adding 20 Typhimurium genomes, and dbscan still fails. I think the isolates are too close to each other, and that's probably what's causing it. My sample fits seem to confirm it - most have very high proportion of matches. It would be good to try a bigger k-mer but seems like anything above 29 is not supported by mash?

image

nickjcroucher commented 4 years ago

You can increase the sketch size to get a more precise measurement of divergence between closely-related isolates - maybe up 10 fold?


From: Alex Predeus notifications@github.com Sent: Monday, October 21, 2019 8:15:13 PM To: johnlees/PopPUNK PopPUNK@noreply.github.com Cc: Croucher, Nicholas J n.croucher@imperial.ac.uk; Comment comment@noreply.github.com Subject: Re: [johnlees/PopPUNK] Can't fit the model for Salmonella Enteritidis (#53)

I've tried adding 20 Typhimurium genomes, and dbscan still fails. I think the isolates are too close to each other, and that's probably what's causing it. My sample fits seem to confirm it - most have very high proportion of matches. It would be good to try a bigger k-mer but seems like anything above 29 is not supported by mash?

[image]https://user-images.githubusercontent.com/7825825/67235368-dc23e380-f43e-11e9-8ce5-be31bf137826.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/johnlees/PopPUNK/issues/53?email_source=notifications&email_token=AD6PWQ4GWA2EVRZOG3UCC4DQPX5UDA5CNFSM4JCLSBL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3OWTQ#issuecomment-544664398, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD6PWQ34LBH5NYDWC5F5U6LQPX5UDANCNFSM4JCLSBLQ.

johnlees commented 4 years ago

Looking at the original plot I think this should be possible with a GMM or refined fit, but possibly with some tweaking. I would certainly try stepping up through values of --K to start with.

johnlees commented 4 years ago

Going to close this, but do reopen if more problems arise