OpenBioML / protein-lm-scaling

Other
54 stars 15 forks source link

Instantiate base model class in HF #2

Closed pascalnotin closed 10 months ago

pascalnotin commented 11 months ago

Subclass the GPT2 HF class: https://huggingface.co/docs/transformers/model_doc/gpt2

talkhanz commented 11 months ago

Hey @pascalnotin this looks interesting. I can work on this! Can you expand on the requirements a little bit more?

pascalnotin commented 11 months ago

Great, thanks @talkhanz ! Idea behind the first few issues posted in the task board is to instantiate a simple end-to-end pipeline that goes from 1) pre-processing raw data, to 2) training a standard AR transformer on it (this issue), to 3)evaluating on downstream task.

The requirement for this issue here (step 2) would thus be to: A- subclass the HF GPT2 class to a new class called say "APT" (Autoregressive Protein Transformer) B- adapt it to handle AA sequences (eg., the GPT2config class since we will operate with a different vocab) C- write a basic train script to train this model on the training data and evaluate on test data (only loss / perplexity tracking should be good for now) D- persist intermediate (eg every 5k steps by default) and final checkpoints to disk

The issue focusing on #3 will then take as input the trained model, and evaluate its performance.

I had done something similar when developing Tranception which may be helpful here (https://github.com/OATML-Markslab/Tranception/tree/main/tranception).

Let me know if any questions!

talkhanz commented 11 months ago

@pascalnotin thanks for getting in touch and sharing the cool details! I'll make an effort to have an update by our weekly call!

cmvcordova commented 11 months ago

Hey! How's this coming along? I'd love to help out.

talkhanz commented 10 months ago

Hey! How's this coming along? I'd love to help out.

yes im in the middle of just finishing a colab notebook as proof of concept so a lending hand would be very handy (pun intended :P)

justin-barton commented 10 months ago

@talkhanz I'd be happy to help out as well if you'd like to share the colab link

talkhanz commented 10 months ago

let me tidy it a bit and ill share it very soon!

talkhanz commented 10 months ago

guys, you can find my colab here

Since i didnt know what the preprocessed dataset would like, i tested my pipeline on uniref50 alongside a random label vector. I'm only considering a very small subset of this dataset for now. I had to use a tokenizer too (also the same as tranception,loaded from a file). This should easily be replaced by any other tokenizer we want (unless we want to preprocess the dataset from some other notebook) The dataset is being streamed using the datasets library (although there is an option to load from a file and convert the dataframe to Dataset class of huggingface) since the whole dataset is around (54gb).

I think having a minimal preprocessed dataset would greatly help here because it will streamline how we organize our modelling notebook. I should've asked this earlier!

Also the model is identical to the tranception model for now (although im working on a simpler from the scratch GPT2 model).

@pascalnotin let me know your thoughts particularly if we should keep the model similar to Tranception or is a new model better?

Also welcome feedback (& questions) from the rest of the people here!

pascalnotin commented 10 months ago

Closing this issue as features have all been instantiated in code. Thanks all