check existing kaggle models

kaichop commented 3 weeks ago

Some people have shared their code using xgboost and random foreset for the prediction. We can borrow the code both for data processing and for prediction, so that we do not need to start from scratch. Compile the information in this issue.

This is a pinned example https://www.kaggle.com/code/andrewdblevins/leash-tutorial-ecfps-and-random-forest that we can reproduce and learn how to use parquet to process the data, and how to use 42 as random seed to ensure consistency of training/testing using different models in the future.

wangwpi commented 3 weeks ago

I'm going to check the pinned example (random forest) and the BERT-fine tuning model on the kaggle.

wangwpi commented 3 weeks ago

The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.

kaichop commented 3 weeks ago

is it affected by "LIMIT 30000" in the SQL code?

On Tue, Jun 11, 2024 at 1:39 PM Peng Wang @.***> wrote:

The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/Project_Belka/issues/4#issuecomment-2161291974, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ . You are receiving this because you authored the thread.Message ID: @.***>

wangwpi commented 3 weeks ago

Yes you are right, in the tutorial the model was trained only on 30000+30000 samples, I will try to train using the whole training dataset and see the performance.

On Tue, Jun 11, 2024 at 1:59 PM Kai Wang @.***> wrote:

is it affected by "LIMIT 30000" in the SQL code?

On Tue, Jun 11, 2024 at 1:39 PM Peng Wang @.***> wrote:

The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/Project_Belka/issues/4#issuecomment-2161291974,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ>

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/WGLab/Project_Belka/issues/4*issuecomment-2161324155__;Iw!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBgdKzXBd$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A67BOBXFOYZSEWFEOSFCPPLZG43H7AVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGMZDIMJVGU__;!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBq_n0HQ6$ . You are receiving this because you were assigned.Message ID: @.***>

wangwpi commented 2 weeks ago

I have uoloaded my notebook for BERT fine tunning (use 60000 data), and a current Neural Network Model using all split data (230M training, 56M validation). The morgan fingerprint for all split data are generated in trunks (500K each trunk) as numpy array file.

WGLab / Project_Belka

check existing kaggle models #4