Tutorial for pretrain RoBERTa with custom data

iambestfeeddddd commented 1 year ago

Hmm, This may seem a bit excessive, but I'm a bit confused and don't know how to preprocess the data and train a RoBERTa model. Can you do a basic step by step tutorial for me? Looks like I'm also looking to implement a custom tokenizer for training. Do you have any suggestions? Thanks a lot.

JonasGeiping commented 1 year ago

Hi, it's hard to write out all the steps without knowing where you're stuck. Did you have a look at https://github.com/JonasGeiping/cramming#data-handling?

Basically the steps are as follows for a custom dataset:

Create a new .yaml file in config/data, you can copy from one of the existing ones
Fill out the sources list (The first argument). You can either create a dataset from the sources already implemented (see config/data/sources), or add new sources yourself. The easiest sources to add are huggingface datasets (see https://github.com/JonasGeiping/cramming/blob/main/cramming/config/data/sources/bookcorpus.yaml for example)
Fill out all the other information, choosing normalization, tokenizer type etc.
Run preprocessing by starting a pretraining run. The run will check your base folder and prepare the dataset according to the config if it cannot be found. Depending on your settings and dataset size, this might require larger amounts of RAM, which you can control by setting impl.threads to a smaller number and impl.max_raw_chunk_size to a smaller number.

JonasGeiping commented 1 year ago

Hope this helps. Closing this for now.

JonasGeiping / cramming

Tutorial for pretrain RoBERTa with custom data #31