bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
91 stars 18 forks source link

some issues with conda version #45

Closed evezeyl closed 5 years ago

evezeyl commented 5 years ago

Hi, I have done some testing to see if we can use poppunk at my work-place: I have some concerns that might be bugs, and some suggestions for improvement

1) I get an error message when assigning queries: the original database is not updated

`Updating reference database to our_samples_query2
Traceback (most recent call last):
  File "/home/evezeyl/anaconda3/envs/poppunk/bin/poppunk", line 10, in <module>
    sys.exit(main())
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 381, in main
    args.rapidnj, args.perplexity)
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/__main__.py", line 488, in assign_query
    threads, mash, True) # overwrite old db
  File "/home/evezeyl/anaconda3/envs/poppunk/lib/python3.7/site-packages/PopPUNK/mash.py", line 282, in constructDatabase
    assert(genome_length > 0)`
UnboundLocalError: local variable 'genome_length' referenced before assignment

The command I used: poppunk --assign-query --ref-db all_in_one3 --distances all_in_one3/all_in_one3.dist --q-files query_list.txt --output our_samples_query2 --update-db

2) I might have problems to reproduce the DSCAN clustering as puplished (I used the listeria test data set as I want to work with listeria)

I will have to look how to use the git-hub version if I want to go further to be able to reuse your functions

Generaly it would be nice to replot those graphs after refitting (good for people that are not yet totally proefficient with all bioinformatic tools...)

3) I looked at the files containing means (x,y) possitions for the blobs in DPGMM - it seemed to me some blobs were not represented (and the scalling was strange: some huge difference in distances estimated...) but I will have to look at that further if we decide to use poppunk) - I used a dataset for listeria downloaded from pasteur institute (diversity dataset - public isolates) The graph is attached. The means reported are bellow. What Am I missing? 2DGMM_fit_DPGMM_fit means : [[0.72720722 0.73200029] [0.09182613 0.21442411] [0.16055147 0.37754502] [0.82941529 0.94804593] [0.02714905 0.08603958] [0.60009867 0.8491825 ]]

Best regards Eve

nickjcroucher commented 5 years ago

Hi Eve,

  1. Hopefully this bug is now fixed with commit: 458ecc1116e6740c62173ef671486b4144bfd4d8

  2. We can take a look at this.

  3. These means are on a normalised [0-1] scaling on both axes, hence there isn't a correspondence with the plotted data. Would it be helpful to print these means relative to the original parameter values?

Thanks,

Nick.

johnlees commented 5 years ago

2 - Could you provide some more information on the command you used, and the output? I am unclear whether you are using just DBSCAN or then following that up with fit refinement?

3 - That plot looks like it has correctly captured the within-strain component, so I think it should work for making the database. You can probably get a slightly better fit at the larger distances by increasing the number of components with --K, but the main thing is that the component nearest the origin is correct (which it appears to be).

As Nick notes, the means are scaled when plotting, with the scaling factor stored in the 'scale' item saved with the model

evezeyl commented 5 years ago

Hei, Thanks for your efforts :) and your fast answer.

for 3) might be useful to print the normalized values on sdout (for those like me requiring manual start) - maybe just add "normalized values" to the legend. (would then remind us about it..)

All the best. Eve

On Thu, Apr 11, 2019, 14:56 nickjcroucher notifications@github.com wrote:

Hi Eve,

1.

Hopefully this bug is now fixed with commit: 458ecc1 https://github.com/johnlees/PopPUNK/commit/458ecc1116e6740c62173ef671486b4144bfd4d8 2.

We can take a look at this. 3.

These means are on a normalised [0-1] scaling on both axes, hence there isn't a correspondence with the plotted data. Would it be helpful to print these means relative to the original parameter values?

Thanks,

Nick.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/johnlees/PopPUNK/issues/45#issuecomment-482102853, or mute the thread https://github.com/notifications/unsubscribe-auth/Arc4jYwKj9DAdGyndPLOVvLKAHNrVSn7ks5vfzEUgaJpZM4cpgJ3 .

evezeyl commented 5 years ago

2 - Could you provide some more information on the command you used, and the output? I am unclear whether you are using just DBSCAN or then following that up with fit refinement?

The command used for the graph:

#database:
poppunk --create-db --r-files references.txt --output distances_db --sketch-size 100000 --plot-fit 5 --min-k 13  
#model fit
poppunk --fit-model --distances distances_db/distances_db.dists --ref-db distances_db \
--output 2DGMM_fit --full-db --K 6 --external-clustering previous_clustering.csv

sdout:

> Fit summary:
>   Avg. entropy of assignment  0.0097
>   Number of components used   6
> Network summary:
>   Components  37
>   Density 0.0379
>   Transitivity    1.0000
>   Score   0.9621

Refining with K 6 (no additional option):

poppunk --refine-model --distances distances_db/distances_db.dists --ref-db 2DGMM_fit  --full-db \
--output 2DGMM_refine 

sdout:

Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol =  1e-05 )
Network summary:
    Components  78
    Density 0.0091
    Transitivity    1.0000
    Score   0.9909

image

I think the boundary is way too much on the narrow side of the intra-lineage blob - so I adjust with manual start: sdout:

The returned value satisfies the termination criteria
(using xtol =  1e-05 )
Network summary:
    Components  36
    Density 0.0450
    Transitivity    0.9434
    Score   0.9010

image

Manual start was defined as such:

 mean0 0.014,0.09
 mean1 0.09182613,0.21442411
 start_point 0.4

I had to play with several start_point to do the trick, but this one appears to work, and there the decision boundary appear to be around where it should be. (But I do not understand why the position of the initial boundary is not the same on the 2 graphs - because I used the same model for refining - this is why I separated folders: to be able to adjust the refinint, restarting always from the same fitted model.)

3 - That plot looks like it has correctly captured the within-strain component, so I think it should work for making the database. You can probably get a slightly better fit at the larger distances by increasing the number of components with --K, but the main thing is that the component nearest the origin is correct (which it appears to be).

If I try with --K 7 Fit summary: Avg. entropy of assignment 0.0071 Number of components used 7 Network summary: Components 38 Density 0.0368 Transitivity 1.0000 Score 0.9632 image

which I do not think is better, because of this strange blob, but it is true the intra lineage blob is better defined then (and the statistics improvement appear minimal)

As Nick notes, the means are scaled when plotting, with the scaling factor stored in the 'scale' item saved with the model

4) Just to illustrate the dbscan option (same data)

poppunk --fit-model --distances distances_db/distances_db.dists \
--ref-db distances_db --output DBSCAN_fit --full-db --dbscan

sdout:

Fit summary:
    Number of clusters  7
    Number of datapoints    5356
    Number of assignments   14261637
Network summary:
    Components  38
    Density 0.0366
    Transitivity    0.9963
    Score   0.9598

But here on the image - its like all the identified cluster are in the within-lineages blob (only color here) and the majority of points are recognized as outliers

image

for the refine: poppunk --refine-model --distances distances_db/distances_db.dists --ref-db DBSCAN_fit --full-db --output DBSCAN_refine

sdout

Network summary:
    Components  4
    Density 0.4529
    Transitivity    1.0000
    Score   0.5471

image

which clearly shows that the boundary is not at the right place .. So I refine with manual start: image

Which is nearly there...I think

Here is why I suggest the following improvements: (keeping in mind that it has to be usable with people learning or with low experience in informatics)

johnlees commented 5 years ago

With Listeria monocytogenes, which has very well separated components, I would think a fit using the DPGMM or HDBSCAN would work best. The refine fit mode, as it currently works, won't get the right angle to draw the boundary, which should be almost vertical. Your refined fits from HDBSCAN and using manual start look ok, and similar to what we got using the Lm dataset from the paper.

Your --K 7 fit looks good too, all the components seem to be well identified. The 'strange blob' you are referring to contains a single point (if you look closely). This could be a poor quality genome to be removed. But actually this won't be a problem, as this component will have a very low weight compared to the others, and so won't have anything assigned to it unless it's right on top of that point. I appreciate this might be misleading from the way we plot this. As we just plot the means and covariances of the components it looks like the components with large covariances will have lots of thins assigned to them (in this case the covariance is large because the single point doesn't contain enough information to shrink it down from the prior). We don't currently represent the weights of the components in the plot, e.g. by changing their opacity, but if we did that strange blob would pretty much disappear.

The plot from DBSCAN is strange however. The summary output suggests it has worked, but this doesn't correspond to the plot at all. Indeed, I have never seen everything classified as noise. This makes me wonder whether we have introduced a bug in our plotting code.

Would you be able to attach your database (i.e. distances_db/distances_db.dists.npy and distances_db/distances_db.dists.pkl) here so I can try and replicate this?

With regards to your suggestions:

Thanks for your testing and input!

evezeyl commented 5 years ago

Here are the distances. Thanks a lot for your explanations (that was really helpful)
distances.zip

johnlees commented 5 years ago

Thanks for the distances. I had indeed introduced a bug into the DBSCAN plotting, which is now fixed. A fit to your distance data now looks like this, which looks good to me:

dbscan_db_dbscan

I have also added in your suggestion to print the means to stderr.

Hope that all works now. Let us know of any further issues!