Closed WilsonGregory closed 2 years ago
I think these fixes are a good way to abstract away the benchmarks. I approve. I have a few questions related to new functionality above, but nothing blocking. Edit: Questions Resolved!
I was curious if we have ran some of these benchmarks already because I wanted to give a small warning related to pytorch(lightning) versions. I remember that the pytorch versions need to be somewhat more recent to get the best results from MarkerMap. I can't definitely remember why, but I think it related to some default initialization. Our methods performed at or slightly better to LassoNet in older versions and had a decent advantage in the more recent versions. The results I had uploaded on Github before were ran with the more recent versions. I am not sure if I updated the arxiv paper with the latest results.
I was curious if we have ran some of these benchmarks already because I wanted to give a small warning related to pytorch(lightning) versions. I remember that the pytorch versions need to be somewhat more recent to get the best results from MarkerMap. I can't definitely remember why, but I think it related to some default initialization. Our methods performed at or slightly better to LassoNet in older versions and had a decent advantage in the more recent versions. The results I had uploaded on Github before were ran with the more recent versions. I am not sure if I updated the arxiv paper with the latest results.
Ooh, thanks for the heads up. I have been using the latest pytorch-lightning, 1.5.10. I have had some trouble getting results that look similar to the paper. Here are some benchmarks that I ran:
Both of these were with the Zeisel data set, and not a huge number of trials (<7). I need to run on Paul and see if it is similar.
Also, this one had Smash RF doing much better than in the paper, but again it is only on 3 runs.
This is the plot that was in the paper that I was comparing against:
I haven't tuned any of the hyper-parameters, I have just been using what was in the Zeisel notebook I believe. So that could also be a factor that I am not using the correct hyper-parameters.
I haven't changed the hyperparameters much.
Another thing might be the pytorch lightning version because of how it sets the automatic learning rate. If we really need to, we can load and evaluate the saved model weights and train/test splits on Colab to confirm the previous results; I don't know how useful that would be. It would be imperative to be able to replicate results on all machines though.
Zeisel was always the dataset that LassoNet actually beat us often, so I am not surprised at those results. If MarkerMap doesn't outperform MouseBrain and Cite-Seq datasets, then I will be a lot more alarmed.
I might try to run Supervised MarkerMap zeisel over night on my computer and see what results I get
for k = 50 markers.
I used to get lower values for classification before so let me see if the irreplicability is a machine / version issue.
Zeisel was always the dataset that LassoNet actually beat us often, so I am not surprised at those results. If MarkerMap doesn't outperform MouseBrain and Cite-Seq datasets, then I will be a lot more alarmed.
Oh okay, that is good to hear. I will run it on some of the other data sets.
I think I have good news.
I just made a new branch off the main right before you did the merge from this pull request and ran just markermap supervised on zeisel.
I got the lower misclassification rates reported before.
Here is a screenshot.
Everything else is commented out because I didn't run those models locally. I think the issue might be versions.
I ran this off the following torch and lightning versions.
I am quite pleased that I was able to replicate the better Zeisel results for MarkerMap supervised, which means maybe we can get them replicated by you too. Not gonna lie I was kinda panicking that my previous results were a fluke! I think the discrepancy is coming from pytorch lightning. It is a bit unfortunate if this is the case that the quality of our results depends on software versions, but that isn't uncommon.
I actually never ran anything on this machine before so that makes me even less panicked because it is fresh!
Details on branch:
The branch is called test_out_jit
but I am sure the results above would be the same if I ran it off the latest main since we didn't change any functionality. You can run a diff between the latest main and the test_out_jit
branch for sake of mind.
Also since you are reading this, not sure if you saw this paper but it helped me out when i was learning about the gumbel max trick!
@beelze-b I am seeing some similar results that you are for this, this was over 10 runs on zeisel:
And here is the versions of pytorch:
I think these results are similar enough that it is within the randomness of the splits. I am going to rerun the benchmark models to see how it compares to LassoNet and Smash, but I think for k=50 I will be able to reproduce your results.
Changes
Tests
Doc Changes
Future TODOs: