Open kaichop opened 3 weeks ago
I'm going to check the pinned example (random forest) and the BERT-fine tuning model on the kaggle.
The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb
in this repository.
is it affected by "LIMIT 30000" in the SQL code?
On Tue, Jun 11, 2024 at 1:39 PM Peng Wang @.***> wrote:
The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.
— Reply to this email directly, view it on GitHub https://github.com/WGLab/Project_Belka/issues/4#issuecomment-2161291974, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ . You are receiving this because you authored the thread.Message ID: @.***>
Yes you are right, in the tutorial the model was trained only on 30000+30000 samples, I will try to train using the whole training dataset and see the performance.
On Tue, Jun 11, 2024 at 1:59 PM Kai Wang @.***> wrote:
is it affected by "LIMIT 30000" in the SQL code?
On Tue, Jun 11, 2024 at 1:39 PM Peng Wang @.***> wrote:
The pinned example (random forest) has been reproduced and the public score for that model is 0.263. The jupyter notebook has been uploaded in the models/Leash_Tutorial_test.ipynb in this repository.
— Reply to this email directly, view it on GitHub https://github.com/WGLab/Project_Belka/issues/4#issuecomment-2161291974,
or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABNG3OFU4JJPTMFTK2CODB3ZG4Y4ZAVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI4TCOJXGQ>
. You are receiving this because you authored the thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/WGLab/Project_Belka/issues/4*issuecomment-2161324155__;Iw!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBgdKzXBd$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A67BOBXFOYZSEWFEOSFCPPLZG43H7AVCNFSM6AAAAABJENBHJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGMZDIMJVGU__;!!IBzWLUs!RHq3vPee1-zokcwQSBOf7k324RAbwD0PQAr4pdszY2Eok80_oT05ln4zEkOZRYFZ3oKNZWVgIE2rl8Ja6XmRBq_n0HQ6$ . You are receiving this because you were assigned.Message ID: @.***>
I have uoloaded my notebook for BERT fine tunning (use 60000 data), and a current Neural Network Model using all split data (230M training, 56M validation). The morgan fingerprint for all split data are generated in trunks (500K each trunk) as numpy array file.
Some people have shared their code using xgboost and random foreset for the prediction. We can borrow the code both for data processing and for prediction, so that we do not need to start from scratch. Compile the information in this issue.
This is a pinned example https://www.kaggle.com/code/andrewdblevins/leash-tutorial-ecfps-and-random-forest that we can reproduce and learn how to use parquet to process the data, and how to use 42 as random seed to ensure consistency of training/testing using different models in the future.