Open karan842 opened 6 months ago
Hello!
I've narrowed the issue down to this line:
val_data = data['validation'].select(range(100, 30))
If I print that dataset, I get:
Dataset({
features: ['sentence1', 'sentence2', 'score'],
num_rows: 0
})
I would recommend updating your code to:
val_data = data['validation'].select(range(100, 130))
and then training works like expected!
This is working on Kaggle but when I am running locally, Running with:
accelerate launch --multi-gpu --num_processes=4 main.py
It is saying,
TypeError: SentenceTransformerTrainingArguments.__init__() got an unexpected keyword argument 'eval_strategy'
I tried with evaluation_strategy and now getting this error:
trainer.train()trainer.train()
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(return inner_training_loop(
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 2577, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/data/projects/llm-env/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 369, in evaluate
File "/data/projects/llm-env/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 369, in evaluate
return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)return super().evaluate(eval_dataset, ignore_keys, metric_key_prefix)
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3365, in evaluate
output = eval_loop(output = eval_loop(
File "/data/projects/llm-env/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 379, in evaluation_loop
File "/data/projects/llm-env/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 379, in evaluation_loop
output = super().evaluation_loop(output = super().evaluation_loop(
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3554, in evaluation_loop
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3554, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3728, in prediction_step
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/data/projects/llm-env/lib/python3.10/site-packages/transformers/trainer.py", line 3728, in prediction_step
return_loss = inputs.get("return_loss", None)
AttributeError: 'NoneType' object has no attribute 'get'
return_loss = inputs.get("return_loss", None)
AttributeError: 'NoneType' object has no attribute 'get'
My dependencies are:
Name: accelerate Version: 0.29.2
Name: transformers Version: 4.39.3
Name: sentence-transformers Version: 3.0.0
The evaluation_strategy
fix was indeed necessary because older versions of transformers
do not yet support the new name of eval_strategy
(they renamed it).
On Kaggle, can you please print the dataset that you're feeding to eval_dataset
in the Trainer? I think it's empty.
I got it I was giving very less amount of data range and running on 4 GPUs. I increased the data size, For train_data=200. val_data=90
Now working on multi-gpu
Oh, it is indeed possible that using less samples than the number of GPUs causes this issue, even if the number of evaluation samples isn't strictly 0.
What Evaluator can we use for (anchor, positive) type of data?
My task is embedding similarity, but it needed label score
You could use InformationRetrievalEvaluator
:
queries = dict(enumerate(dataset["anchor"]))
corpus = dict(enumerate(dataset["positive"]))
relevant_docs = {idx: idx for idx in range(len(dataset))}
ir_evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant_docs,
name="...",
)
Here the "positive" is the relevant document and all other texts are seen as not relevant.
You can get roughly the same behaviour with the TranslationEvaluator
:
evaluator = TranslationEvaluator(
source_sentences=dataset["anchor"],
target_sentences=dataset["positive"],
name="...",
)
This computes what percentage of time the anchor at index $i$ is most similar to the positive at index $i$ out of all positive texts. The bigger the dataset, the harder both of these tasks get.
Cool! Thanks
Lastly what is best way to store model locally?
save_safetensors
or save_only_model
I just need model locally which I will load into another script for testing
model.save_pretrained("local_path")
is the recommended way.
Do we need to mention DataLoader
and collate_fn
explicitly for lazy loading?
Are these arguments, doing same thing?
per_device_train_batch_size=64, per_device_eval_batch_size=64,
I am training on huge data where number of records in train and validation are hundreds of millions. I have 4 GPU 15 GB each. I am new to large data training in prod. How can I run on this machine configuration?
Final Code:
def main():
data_files = split_data(data_dir)
# print(data_files)
data = load_dataset("parquet", data_dir=data_dir, data_files=data_files)
train_data, val_data = data['train'], data['val']
print(train_data,"\n\n")
print(val_data,"\n\n")
word_embedding_model = models.Transformer('sentence-transformers/all-MiniLM-L6-v2')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
loss = losses.MultipleNegativesRankingLoss(model)
training_args = SentenceTransformerTrainingArguments(
output_dir='sbert-output-dir',
num_train_epochs=1000,
seed=33,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
learning_rate=2e-5,
warmup_ratio=0.2,
save_only_model=True,
fp16=True,
evaluation_strategy="steps",
eval_steps=50,
save_total_limit=50,
load_best_model_at_end=True,
metric_for_best_model='spearman_cosine',
greater_is_better=True
)
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
loss=loss
)
trainer.train()
model.save_pretrained('sbert-model')
if __name__ == '__main__':
main()
I am trying to run below code from @tomaarsen's HuggingFace blog on Sentence-Transformer-V3
Code:
This code was working on pre-release==3.0.0 version but not on both
sentence-transformers[train]
,sentence-transformera[dev]
Error:
Thanks, Karan Shingde