ersilia-os / chempfn

Ensemble-based, size-agnostic wrapper for the TabPFN classifier
GNU General Public License v3.0
28 stars 0 forks source link

Test why predict is getting stuck #20

Open DhanshreeA opened 1 year ago

DhanshreeA commented 1 year ago

Copying over conversation from Slack, issue raised by @GemmaTuron:

At this moment, when I try to run the code with a small dataset (<2000 molecules) the training takes only a few seconds, but the prediction (clf.predict) gets stuck...

Investigate why the module gets stuck during predict, and reproduce the TDC results generated earlier (refer notebook)

DhanshreeA commented 1 year ago

Hi @GemmaTuron I ran the experimentation with TDC CYP2C9 Veith benchmark dataset again and the system works. Here are a couple of observations that also address your concern around predict being stuck:

  1. In the previous version of the notebook you will find experiments with different flavors of feature sub samplers, whereas the logic has been changed to use ensembles of feature sub samplers instead of expecting the user to configure it. This is implemented in a nested loop fashion, something like:
for data_sample in data_ensembles:
    feat_ensembles = get_feat_ensembles(data_sample)
    for feat_sample in feat_ensembles:
          x, y = feat_sample
          tabpfn.fit(x,y)
          tabpfn.predict(x_test)

This is currently leading to longer run times especially with a high value of max_iters which is the input that configures number of data ensembles to EnsembleTabPFN.

One of the possibilities that @miquelduranfrigola and I had considered was to use some heuristic or apriori information to not make use of all feature ensembles, or incorporate some sort of early stopping strategy. I'll get to that soon.

  1. I ran the model with three different values of max_iters: 100 (default), 50, 10:

For now, you can use it by cloning it and directly doing a pip install. I'll update it on PyPI soon.

DhanshreeA commented 1 year ago

I'll use this issue to track any problems you have if any while running the library with the latest updates and if you are able to reproduce the results, I'll close this issue. @GemmaTuron

DhanshreeA commented 1 year ago

Hi @GemmaTuron could you get a chance to test this?

GemmaTuron commented 1 year ago

Hi @DhanshreeA !

Sorry for the delayed response. I've followed your suggestion and installed the package from the repo directly Using a dataset of 10k and the current predetermined parameters. I am also not sure to which function I need to pass the max_iter parameters, can you add this to the README?

I am simply aiming at running this:

from ensemble_tabpfn import EnsembleTabPFN
clf = EnsembleTabPFN()
clf.fit(X_train, y_train)
preds = clf.predict(X_test)

Where X is a list of SMILES with only one associated Activity

GemmaTuron commented 1 year ago

I mean, I can leave it running if you think that will help, but I am concerned that fitting is taking only 1 second

GemmaTuron commented 4 months ago

@DhanshreeA I still get inconsistencies in the times needed to train and predict with ChemPFN

Can you give me an update of what are the expected times with a training set of 10K and a prediction set of 1K for example?