Open ganeshkrishnan1 opened 3 months ago
Do you use it for training or inference?
I used it for training. It looks like the script does use multiple gpu but it runs out of memory due to high batch size. I will close this ticket
I ran this on lower batch count and I can see the trainer never uses more than 1 GPU
I used the example provided and also put accelerator but both of them fails to use more than 1 GPU. Any suggestions?
hi @ganeshkrishnan1 , could you provide the training script?
Here is one example that can be successfully run on multi gpus:
CUDA_VISIBLE_DEVICES=0,1 WANDB_MODE=disabled torchrun --nproc_per_node=2 --master_port=2345 train_cli.py \
--model_name_or_path mixedbread-ai/mxbai-embed-large-v1 \
--train_name_or_path ./snli_5k.jsonl --save_dir mxbai-snli-ckpts \
--w1 0. --w2 20.0 --w3 1.0 --angle_tau 20.0 --learning_rate 3e-6 --maxlen 64 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 32 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 --seed 42 --gradient_accumulation_steps 2 --fp16 1 --torch_dtype 'float32'
train_cli.py is from: https://github.com/SeanLee97/AnglE/blob/main/angle_emb/train_cli.py
data format:
$ head -3 snli_5k.jsonl
{"text": "A person on a horse jumps over a broken down airplane.", "positive": "A person is outdoors, on a horse.", "negative": "A person is at a diner, ordering an omelette."}
{"text": "Children smiling and waving at camera", "positive": "There are children present", "negative": "The kids are frowning"}
{"text": "A boy is jumping on skateboard in the middle of a red bridge.", "positive": "The boy does a skateboarding trick.", "negative": "The boy skates down the sidewalk."}
This is my python code:
I experimented with accelerator, then torch distributed and also added to(device).
I will try with your method and see if it works out with 4 gpus.
from sentence_transformers import InputExample, losses, SentenceTransformer
from torch import optim
from sentence_transformers import SentenceTransformer, models, losses
import torch
from datasets import load_dataset,Dataset, DatasetDict
from angle_emb import AnglE, AngleDataTokenizer
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, StateDictType, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device
train_json_file = './brand_search_term_con_new.json_9.json' # JSON file for training data
full_dataset = load_dataset('json', data_files=train_json_file, split='train')
desired_test_size = 5000
# # Calculate the training set size
train_size = len(full_dataset) - desired_test_size
# # Split the dataset into training and evaluation sets
split_datasets = full_dataset.train_test_split(test_size=desired_test_size, train_size=train_size)
dataset_dict = DatasetDict({
'train': split_datasets['train'],
'test': split_datasets['test']
})
# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=512, pooling_strategy='cls',device_map = 'auto').to(device)
# # 3. transform data
train_ds = dataset_dict['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
test_ds = dataset_dict['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
# angle, train_ds, test_ds = accelerator.prepare(angle, train_ds, test_ds)
angle.to(device)
# # 4. fit
angle.fit(
train_ds=train_ds,
valid_ds=test_ds,
output_dir='trainedmodel/aihello-model',
batch_size=8,
epochs=2,
learning_rate=2e-5,
save_steps=5000,
eval_steps=5000,
warmup_steps=100,
gradient_accumulation_steps=4,
loss_kwargs={
'w1': 1.0,
'w2': 1.0,
'w3': 1.0,
'cosine_tau': 20,
'ibn_tau': 20,
'angle_tau': 1.0
},
fp16=True,
logging_steps=100
)
corrcoef, accuracy = angle.evaluate(test_ds, device=angle.device)
print('corrcoef:', corrcoef)
The shell script worked and I got the checkpoint as well with multiple GPUs.
Python code didn't use the multiple GPU though.
I haven't tried multiGPU in python code, just used it supported by Transformers Trainer.
BTW, here are some tips to improve the model:
1) if your dataset is FormatA: {'text1': "", "text2": "", "label": float or int}, it is better slightly increase weight for w1.
2) if your dataset is FormatB: {'text': "", "positive": "", "negative": ""}, the suggested parameters are w1=0, w2=20, w3=1.0, angle_tau=20.0
Thanks for the tip about the w. I am using DataFormat C.
eg
{"text": "Cool Spot 11x11 Pop-Up Instant Gazebo Tent with Mosquito Netting Outdoor Canopy Shelter with 121 Square Feet of Shade by COOS BAY (Beige)", "positive": "outdoor tent canopy"}
Should I use the same as B?
DataFormats.C is okay. However, DataFormats.B is recommended since it can improve performance more significantly.
BTW, here are the tips, we will push it in the next version.
Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative
Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)
Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative
Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)
For such large datasets, it is better to specify a small learning_rate such as 1e-6, and specify --fixed_teacher_name_or_path
to alleviate information forgetting.
I don't mind catastrophic forgetting. I could even train from scratch with the amount of data we have. The learning rate is currently set to 3e-6. It took 8 hours for the dataset to load so I think I will let this training run and then re-run with the smaller one you mentioned.
Your models don't seem compatible with KeyBert https://github.com/MaartenGr/keyBERT so that's one more challenge for me
I found KeyBert works for sentence-transformers. Maybe you can add a feature to make it support angle_emb
.
I will ask someone from our team to look into it. Right now its easier for me to use this for generating vectors and training a different sentence transformers for generating keywords from documents: two different usecases
btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?
btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?
Sure! thanks!
BTW, I am working on exporting sentence-transformers (ST) model so that the AnglE-trained model can be used in ST.
I am running out of memory on Tesla T4. I have 4 of them though and I usually use accelerator for multigpu setup. How can I use them for angle semantic similarity?