Pre-training data - Githubissues

alchemab / antiberta

Public repository describing training and testing of AntiBERTa.

Apache License 2.0

53 stars 13 forks source link

Pre-training data #6

Closed brianloyal closed 1 year ago

brianloyal commented 1 year ago

Hey team, great work on this project. I noticed that the pre-training data snippets were removed from the repository back in April. Are they available as a HF dataset? Or available someplace else (besides the git history)?

brianloyal commented 1 year ago

FYI I created this dataset in HuggingFace for now. Please let me know if this causes any issues. Thanks again!

https://huggingface.co/datasets/bloyal/antiberta-pretrain

ideasbyjin commented 1 year ago

Hey Brian, just an FYI that sample dataset we used to have is really a snippet of what was trained for the model. Really, the code repo here was more for people to get a flavour of how they could train an antibody language model (which there are now tons of examples!). Thanks for uploading that to HF.

Junseok0207 commented 6 months ago

FYI I created this dataset in HuggingFace for now. Please let me know if this causes any issues. Thanks again!

https://huggingface.co/datasets/bloyal/antiberta-pretrain

@brianloyal Hi, I found your created dataset to be incredibly valuable. However, as a beginner in this field, I'm curious about the preprocessing steps you undertook on the data sourced from the OAS database. Could you please share more details about it?