Closed iambestfeeddddd closed 1 year ago
Hi, it's hard to write out all the steps without knowing where you're stuck. Did you have a look at https://github.com/JonasGeiping/cramming#data-handling?
Basically the steps are as follows for a custom dataset:
config/data
, you can copy from one of the existing onesconfig/data/sources
), or add new sources yourself. The easiest sources to add are huggingface datasets (see https://github.com/JonasGeiping/cramming/blob/main/cramming/config/data/sources/bookcorpus.yaml for example)impl.threads
to a smaller number and impl.max_raw_chunk_size
to a smaller number.Hope this helps. Closing this for now.
Hmm, This may seem a bit excessive, but I'm a bit confused and don't know how to preprocess the data and train a RoBERTa model. Can you do a basic step by step tutorial for me? Looks like I'm also looking to implement a custom tokenizer for training. Do you have any suggestions? Thanks a lot.