ccdv-ai / convert_checkpoint_to_lsg

Efficient Attention for Long Sequence Processing
MIT License
86 stars 11 forks source link

Training a converted model #10

Open shensmobile opened 9 months ago

shensmobile commented 9 months ago

I've been going back and forth with RoBERTa and Longformer for classification. My typical use case is quite sporadic, as most of my documents are around 300 tokens, but occassionally I get massive 1000 token inputs (probably split 70:30). Longformer is definitely superior to split/chunk/pool the documents with RoBERTa for long documents but I find often that RoBERTa is more accurate for shorter documents.

I'm interested in trying out LSG but I'm curious about the training process. I like that with Longformer I can train on long documents instead of having to truncate/strip out the middle of documents to fit under the 512 token window and get the full context of the document to help the model learn. With LSG, am I able to train after the conversion to take advantage of the longer context during training?

Edit: There is mention of training memory requirements in the main repo, but just wanted to confirm that I can train the LSG-converted model just like I would the original RoBERTa model (albeit with longer context). If so, are there are best practices? Like, would it be better to start with the base RoBERTa model or my fine-tuned RoBERTa model that was trained on the classification task already (using truncated inputs)? And what are some optimal local/sparse block and sparsity settings for text classification?

ccdv-ai commented 9 months ago

hi @shensmobile You can train the LSG model the same way as the other models.

Two ways to use it:

  1. Fine-tune the base model then convert it for the inference to process longer inputs (no truncation needed)
  2. Convert the base model and fine tune it on longer sequences directly

The second way is better but requires more ressources.

If your model is converted to a 4096 tokens, you can fit up to 4096 tokens sequences without problem. Removing sparse token improves training speed and reduces memory print.

In practice, I recommend you to start with local attention only if your inputs are not that long, e.g: (block_size=256, sparse_block_size=0) or (block_size=128, sparse_block_size=0). Note that if seq_length <= 2*block_size, full attention is used for this sequence.

For classification, 'bos' sparse schema works well in practice for very long sequences, e.g: (block_size=128, sparse_block_size=128, sparsity_factor=8, sparsity_type="bos")

shensmobile commented 9 months ago

Hi,

Thanks for the reply! I'll start with those parameters to begin with.

When I convert a model (with the pip package in Python) that's already been fine-tuned on a classification task, I get this error:

Some weights of LSGRobertaForSequenceClassification were not initialized from the model checkpoint at transformers/SapienBilling/SapienRoBERTa/Osler_Base and are newly initialized: ['roberta.embeddings.global_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Do I need to specify to not using global tokens? Or is this ignorable?

Also, when I directly convert my language model and attempt to train immediately, I get this error during the training cycle:

RuntimeError: The expanded size of the tensor (973) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [4, 973].  Tensor sizes: [1, 514]

Do I need to save the model and tokenizer first, then reload so I can specify trust_remote_code=True?

Sorry for these really basic questions, I appreciate your help.

ccdv-ai commented 9 months ago

First warning is ignorable.

Should work out of the box with this code:


from lsg_converter import LSGConverter

# To convert a model
model_path = "myroberta_model" # or whatever model
converter = LSGConverter(max_sequence_length=4096)

# Simple conversion
model, tokenizer = converter.convert_from_pretrained(model_path, block_size=128, sparse_block_size=0)

# If you need to change the architecture, example: MaskedLM to SequenceClassification
# Useful if you load a "roberta-base" model
model, tokenizer = converter.convert_from_pretrained(model_path, block_size=128, sparse_block_size=0, architecture="RobertaForSequenceClassification")

# Some training logic
<do some training here>

# Save after training
model.save_pretrained("my_lsg_model")
tokenizer.save_pretrained("my_lsg_model")

# To reload
model = AutoModelForSequenceClassification.from_pretrained("my_lsg_model", trust_remote_code=True)
tokenizer= AutoTokenizer.from_pretrained("my_lsg_model", trust_remote_code=True)

If you have a problem, try to convert, save the model and reload to make sure everything is fine. Don't forget to update transformers

shensmobile commented 9 months ago

It was transformers. I was on 4.30.2. Updating to the latest version appears to have gotten it working. Now I get to tinker around with values to get it to fit on my GPU and not take half a day.

Thanks for the help! I'm curious to see how this stacks up against Longformer for my data. Not sure if it's valuable but I'll report back if there's significant improvements in real world applications!

shensmobile commented 8 months ago

@ccdv-ai early results have been good! I've run into some issues with memory limits with (block_size=256, sparse_block_size=128, sparsity_factor=8, sparsity_type="bos_pooling"). Given that most of my documents are less than 512, and only 20-30% are 512-1024 and maybe 1-2% would be 1024+, does it make sense to use such a large sparse_block_size and sparsity_factor? Would I still be able to achieve good multiclass text classification results with a smaller sparse block size?

ccdv-ai commented 8 months ago

@shensmobile For a given token, the maximum context is equal to 3*block_size + 2*sparse_block_size*sparsity_factor.

Its better to use the same size for blocks and sparse blocks for efficiency reasons. Using local attention only (sparse_block_size=0) works well if your sequences are not too long. block_size=256 is already up to 768 context per token.

You can remove dropout on attention to reduce memory even more, check your model config to get the name of the param.

shensmobile commented 8 months ago

Thanks! I appreciate the advice. I'm goign to experiment with 256/0/0 and with 128/128/8 with the added context that matching block sizes is better for efficiency.

I'm currently re-implementing gradient accumulation and the BnB 8bit Adam optimizer. I had issues implementing gradient checkpointing. Would removing dropout on attention be a better first step than either of the afforementioned options?

Edit: 8bit Adam optimizer actually had the opposite impact; it took up more memory instead of saving. I opted for just using gradient accumulation. Removing dropout on attention did save me some memory, but it didn't save me enough that I could increase my batch size. I could increase my gradient accumulation, but training speed did not notably improve. Currently training a 256/0/0 model, may revisit dropout/gradient accumulation when I try training the 128/128/8 model.

Out of curiosity, if I wanted to try 128/128/4, I should switch to norm or pooling right? The documents say that the best sparsity type is task dependent, would pooling still be the best for text classification?

ccdv-ai commented 8 months ago

You can also try using fp16 instead of fp32. Gradient accumulation is fine. Changing the optimizer can reduce memory, SGD is lighter than Adam but convergence is slower. If you are on a multi gpus setup, you can try using Deepspeed stage 1 or 2 to offload gradients and optimizer states. You can also try PEFT, this will significantly reduce the memory usage of the optimizer.

The best hyperparameter choice is very task/data specific. If you are short on memory just remove sparse connections.

shensmobile commented 8 months ago

Currently using fp16, gradient accumulation, and went back to Adam since 8-bit Adam didn't seem to be saving any memory, but I could try SGD/AdaFactor as well. Still barely fitting with a batch size of 2/gradient accumulation of 4 onto my 4090.

Removing sparse connections helped, but I do like the idea of additional context being available. I'm testing 256/0/0 and 128/128/4 to see which would be best for my application. Just takes a while to train :)

Edit: Actually, training 128/128/4 now and it takes up less memory than just 256/0/0.

Edit 2: Huzzah! Training on 128/128/4/Pooling was excellent. My best results yet. I was able to eke out an additional 3% F1 score (91% to 94%) against my test set over both RoBERTa with chunking and Longformer. Thanks for all of the help!