HUBioDataLab / SELFormer

SELFormer: Molecular Representation Learning via SELFIES Language Models
71 stars 14 forks source link

[Request] Move models, tokenizers, and datasets to huggingface #1

Closed osbm closed 1 year ago

osbm commented 1 year ago

I see that this repo is already using huggingface packages. It would be a lot more accessible for everyone if these models were hosted in huggingface. I believe using an official huggingface organization page will be highly beneficial.

tuncadogan commented 1 year ago

Thank you very much for your suggestion. Actually, we intend to do this as soon as possible. We are struggling with finding time to accomplish similar tasks right now.

osbm commented 1 year ago

I understand, actually, I am interested in this project and have some spare time. I can help you guys set up what you need in huggingface.

We can:

  1. Set up the HUBioDataLab organization page.
  2. Add @tuncadogan as admin. And many others as contributors.
  3. Upload models and datasets.
  4. And even some interactive demos using Streamlit or Gradio to showcase what we got.
tuncadogan commented 1 year ago

That would be great! What do you need from us right now?

osbm commented 1 year ago

Thanks for letting me do this. I have actually set up the huggingface account: https://huggingface.co/HUBioDataLab

First of all, I need you to create a huggingface account and give me the username so that I can give you admin permission.

And I already uploaded some models from the google drive links in readme: modelO, modelC, modelM. We can change the names of the models. We can add "SELFormer-" as a prefix to the name of these models so that they are not confused with the models from other projects in this organization.

All these models, and datasets in huggingface are git repositories. And their README.md files are treaded as model cards. Things we can add to the model pages:

You can even pull these 3 models right now:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HUBioDataLab/modelO")
model = AutoModelForMaskedLM.from_pretrained("HUBioDataLab/modelO")
tuncadogan commented 1 year ago

Thank you very much!

I've created the account: tuncadogan

Changing model names would be great actually. This is due to different naming we used in the article. ModelO should be SELFormer and ModelM should be SELFormer-Lite. We discarded ModelC so it may be better to delete it (we were going to delete it from the GitHub as well).

For the rest, yes that would be great. Would you like to take a look at the article for the example, so that we can further discuss this (for example, our fine tuned models for different molecule property prediction tasks)? We also include some of them in the GitHub repo. Link to article: https://arxiv.org/abs/2304.04662

Bibtex or Citation information would be great (again, the same pre-print article as above).

Regarding the License, can we use the ame one in GitHub?

Maybe we can continue our discussion on email or huggingface platform?

osbm commented 1 year ago

Great! I have added your account to the organization. Changed model names. Deleted modelC. I have also added citations to the SELFormer and SELFormer-Lite models.

Yes, i believe we can use the same license. I added the GPL license to the models as well as the license text from the README file in this repository.

I have read a great chunk of the paper. I think I understand how you guys trained those models. I am currently focused on freesolv model I want to complete its model card and get it ready for use in 1 line. After its done it will be easier for other models I think.

I also uploaded datasets to huggingface. We can also add README files to each dataset.

OK, I am closing this issue with this comment and will send out an email to you.