Benchmark Wrappers - Githubissues

WilsonGregory commented 2 years ago

Changes

a mixed bag of different changes related to the benchmark capabilities, more still to come
made BenchmarkableModel, a class that provides the getBenchmarker function, and expects there to be a benchmarkerFunctional function in the inheriting subclass
added LassoNetWrapper, a class that wraps LassoNetClassifier and implements benchmarkerFunctional
added RandomBaseline, a random baseline model that just picks k random indices for the markers
implemented benchmarkerFunctional for Global Gate and Concrete VAE.
add a mode to plot_benchmarks to allow specifying misclass rate, accuracy, and later other types of output

Tests

ran the benchmark script
add to tests for plot_benchmarks

Doc Changes

none

Future TODOs:

finish implementing benchmarkFunctional for smashpy, L1 VAE, RankCorr
return test_rep, cm from the benchmark function in a way they can be reasonably used

beelze-b commented 2 years ago

I think these fixes are a good way to abstract away the benchmarks. I approve. ~~I have a few questions related to new functionality above, but nothing blocking.~~ Edit: Questions Resolved!

I was curious if we have ran some of these benchmarks already because I wanted to give a small warning related to pytorch(lightning) versions. I remember that the pytorch versions need to be somewhat more recent to get the best results from MarkerMap. I can't definitely remember why, but I think it related to some default initialization. Our methods performed at or slightly better to LassoNet in older versions and had a decent advantage in the more recent versions. The results I had uploaded on Github before were ran with the more recent versions. I am not sure if I updated the arxiv paper with the latest results.

WilsonGregory commented 2 years ago

I was curious if we have ran some of these benchmarks already because I wanted to give a small warning related to pytorch(lightning) versions. I remember that the pytorch versions need to be somewhat more recent to get the best results from MarkerMap. I can't definitely remember why, but I think it related to some default initialization. Our methods performed at or slightly better to LassoNet in older versions and had a decent advantage in the more recent versions. The results I had uploaded on Github before were ran with the more recent versions. I am not sure if I updated the arxiv paper with the latest results.

Ooh, thanks for the heads up. I have been using the latest pytorch-lightning, 1.5.10. I have had some trouble getting results that look similar to the paper. Here are some benchmarks that I ran:

Both of these were with the Zeisel data set, and not a huge number of trials (<7). I need to run on Paul and see if it is similar.

Also, this one had Smash RF doing much better than in the paper, but again it is only on 3 runs.

This is the plot that was in the paper that I was comparing against:

I haven't tuned any of the hyper-parameters, I have just been using what was in the Zeisel notebook I believe. So that could also be a factor that I am not using the correct hyper-parameters.

beelze-b commented 2 years ago

I haven't changed the hyperparameters much.

Another thing might be the pytorch lightning version because of how it sets the automatic learning rate. If we really need to, we can load and evaluate the saved model weights and train/test splits on Colab to confirm the previous results; I don't know how useful that would be. It would be imperative to be able to replicate results on all machines though.

Zeisel was always the dataset that LassoNet actually beat us often, so I am not surprised at those results. If MarkerMap doesn't outperform MouseBrain and Cite-Seq datasets, then I will be a lot more alarmed.

beelze-b commented 2 years ago

I might try to run Supervised MarkerMap zeisel over night on my computer and see what results I get

for k = 50 markers.

I used to get lower values for classification before so let me see if the irreplicability is a machine / version issue.

WilsonGregory commented 2 years ago

Zeisel was always the dataset that LassoNet actually beat us often, so I am not surprised at those results. If MarkerMap doesn't outperform MouseBrain and Cite-Seq datasets, then I will be a lot more alarmed.

Oh okay, that is good to hear. I will run it on some of the other data sets.

beelze-b commented 2 years ago

I think I have good news.

I just made a new branch off the main right before you did the merge from this pull request and ran just markermap supervised on zeisel.

I got the lower misclassification rates reported before.

Here is a screenshot.

Everything else is commented out because I didn't run those models locally. I think the issue might be versions.

I ran this off the following torch and lightning versions.

I am quite pleased that I was able to replicate the better Zeisel results for MarkerMap supervised, which means maybe we can get them replicated by you too. Not gonna lie I was kinda panicking that my previous results were a fluke! I think the discrepancy is coming from pytorch lightning. It is a bit unfortunate if this is the case that the quality of our results depends on software versions, but that isn't uncommon.

I actually never ran anything on this machine before so that makes me even less panicked because it is fresh!

Details on branch:

The branch is called test_out_jit but I am sure the results above would be the same if I ran it off the latest main since we didn't change any functionality. You can run a diff between the latest main and the test_out_jit branch for sake of mind.

beelze-b commented 2 years ago

Also since you are reading this, not sure if you saw this paper but it helped me out when i was learning about the gumbel max trick!

https://arxiv.org/abs/2110.01515

WilsonGregory commented 2 years ago

@beelze-b I am seeing some similar results that you are for this, this was over 10 runs on zeisel:

And here is the versions of pytorch:

I think these results are similar enough that it is within the randomness of the splits. I am going to rerun the benchmark models to see how it compares to LassoNet and Smash, but I think for k=50 I will be able to reproduce your results.

Computational-Morphogenomics-Group / MarkerMap

Benchmark Wrappers #12

Changes

Tests

Doc Changes

Future TODOs: