Updates to DBSCAN fitting

nickjcroucher commented 4 months ago

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board
slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new --assign-subsample option) negates the speed up of the initial fit
slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a --no-assign flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

If you approve these changes conceptually, then I'll add tests and docs.

johnlees commented 4 months ago

Will take a look before end of today!

nickjcroucher commented 4 months ago

Thanks, but it's not urgent, I just wanted to fix my broken access issues and get this onto the right repo before my machine melts!

johnlees commented 4 months ago

Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:

Sounds like a good thing to improve, I've typically been using refined boundaries as everything else seems to struggle. Assigning to DBSCAN models is slow -- apparently CUDA was going to add something to make this much faster?

* indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board

No, sounds like a good change! I think that's caused frustrations before (and heaven knows we don't need more options)

* slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new `--assign-subsample` option) negates the speed up of the initial fit

Interesting, will look at this part

* slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a `--no-assign` flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points

Would it be appropriate to make this the default or perhaps remove the former behaviour as an option altogether? Aware we have a very long CLI

If you approve these changes conceptually, then I'll add tests and docs.

Great, will have a quick look through the code now

johnlees commented 4 months ago

Looks like no issues with tests & mandrake here?

nickjcroucher commented 4 months ago

Thanks for the comments! Nope, no mandrake issues here, seems like it is a local installation issue. I will adjust the CLI and cascade the changes through the code.

johnlees commented 4 months ago

One final thing – maybe want to bump the version if not already past current release

nickjcroucher commented 4 months ago

One final thing – maybe want to bump the version if not already past current release

Glad someone was paying attention, done in 2920b00.

bacpop / PopPUNK

Updates to DBSCAN fitting #302