Potential bug in neighborhood assignments

angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis

https://ark-analysis.readthedocs.io/en/latest/

MIT License

73 stars 26 forks source link

Potential bug in neighborhood assignments #968

Closed ngreenwald closed 1 year ago

ngreenwald commented 1 year ago

Please refer to our FAQ and look at our known issues before opening a bug report.

Describe the bug I'm running into some weird behavior with the neighborhood analysis script. Specifically, it seems like cells with very similar neighborhoods are being assigned to different clusters.

For example, in the upper right hand corner, all of the blue cancer cells seem to have almost exactly the same neighbors

However, they are assigned to different neighborhoods in the output.

I'm not sure if this is related to #967. It could be that the visualization isn't working correctly. However, the heatmap of the clusters roughly lines up with the visual, so I think that's less likely. Not sure exactly what's going on. I think a good first step once #967 is resolved will be to re-run on some previous data and confirm that we still get the qualitatively same clustering results, making sure to re-generate the neighbor_counts, rather than using the previously extracted ones.

camisowers commented 1 year ago

Looks like the color assignment is accurate based on the outputted kmeans clustering results. Seems like it could be an issue with either the neighbor matrices or distance matrices calculation, which have both been adjusted in the last 4 months. Which previous data should I test out?

ngreenwald commented 1 year ago

You could either rerun it on some data, like Erin’s or the example dataset, where you know what it should like look like. Or you could take the same data, and run it with the commit from a couple months ago before the refactoring.

I’m not 100% convinced there’s a problem, but I think there might be. So just some initial validation to figure out if there’s an obvious issue or not

— Reply to this email directly, view it on GitHub https://github.com/angelolab/ark-analysis/issues/968#issuecomment-1491121878, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJB47JDPK6F4GMF3L36AZLW6YOOJANCNFSM6AAAAAAWNVSZTA . You are receiving this because you authored the thread.Message ID: @.***>

camisowers commented 1 year ago

I verified that the the generated neighbors matrices have not changed, but it looks like there was an issue with the Kmeans function call itself. Scikit-learn 1.2 changed the default n_init param from 10 to 'auto', which then caused the algorithm to run only once (see here). On the left is the clustering using a commit from October and the right is using main.

I was able to get the same results as before by adding n_init=10 to the Kmeans() call.

I can open a quick PR now to fix this.

ngreenwald commented 1 year ago

Sounds good, thanks! Then we can redo the TONIC clustering and see if things still look weird or if this was the issue.