Closed nickjcroucher closed 4 months ago
Will take a look before end of today!
Thanks, but it's not urgent, I just wanted to fix my broken access issues and get this onto the right repo before my machine melts!
Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:
Sounds like a good thing to improve, I've typically been using refined boundaries as everything else seems to struggle. Assigning to DBSCAN models is slow -- apparently CUDA was going to add something to make this much faster?
* indistinct clustering criterion; this was very strict (separation between within and between strain cluster on both axes required), rejecting some sensible fits; now relaxed (separation only required on one axis) - let me know if you think the stricter option should still be available though a flag, or if you're happy with change across the board
No, sounds like a good change! I think that's caused frustrations before (and heaven knows we don't need more options)
* slow fitting to large datasets; implemented GPU version of DBSCAN, which is fast; the problem is then assigning all distances, which is slow, because the model takes up a lot of GPU memory, and copying over batches of distances into the variable amount of remaining GPU memory (customisable with the new `--assign-subsample` option) negates the speed up of the initial fit
Interesting, will look at this part
* slow assignment of distances to model fit; this is inefficient, as we typically don't use the assignments of points to the initial model fit, and it takes ages on a large dataset. Instead I have added a `--no-assign` flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all points
Would it be appropriate to make this the default or perhaps remove the former behaviour as an option altogether? Aware we have a very long CLI
If you approve these changes conceptually, then I'll add tests and docs.
Great, will have a quick look through the code now
Looks like no issues with tests & mandrake here?
Thanks for the comments! Nope, no mandrake issues here, seems like it is a local installation issue. I will adjust the CLI and cascade the changes through the code.
One final thing – maybe want to bump the version if not already past current release
One final thing – maybe want to bump the version if not already past current release
Glad someone was paying attention, done in 2920b00.
Motivated by trying to fit a DBSCAN model to a large dataset. Problems were:
--assign-subsample
option) negates the speed up of the initial fit--no-assign
flag, which skips the assignment, labels the model appropriately, and allows a refined model fit that then assigns all pointsIf you approve these changes conceptually, then I'll add tests and docs.