Open hugocool opened 9 months ago
Hello,
Thank you for submitting the issue. I should be able to take a look over the weekend.
I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend.
I know! I digged into the code, and that’s why I did the args set_dataloader. But apparently it’s getting ignored, so I don’t know how to trouble shoot this, or maybe there is some environment variable that is playing a role here, idk..
Hugo Evers On 20 Oct 2023 at 15:56 +0200, Frédéric Branchaud-Charron @.***>, wrote:
Hello, Thank you for submitting the issue. I should be able to take a look over the weekend. I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I was thinking about it and maybe it's because of the stacking we perform.
For our HF implementation, we always perform MC-Dropout in a single pass meaning that batch_size=2 will result in a batch size of 2 * ITERATIONS
to be fed to the model. You said that a manual batch_size of 8 is your maximum so 2*30=60 which is too much.
Our ModelWrapper
implementation has a flag replicate_in_memory
which avoid stacking, but we have it for HF.
It is fairly trivial to add this feature so I'll do that.
I just tested the progress bar problem and it seems to work. :thinking:
from datasets import load_dataset
from transformers import pipeline, TrainingArguments, DataCollatorWithPadding
from baal.transformers_trainer_wrapper import BaalTransformersTrainer
TEXT_COL = 'sentence'
ds = load_dataset('sst2')['test'].remove_columns('label')
pipe = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
tokenizer = pipe.tokenizer
model = pipe.model
def preprocess_function(examples):
return tokenizer(examples[TEXT_COL], truncation=True)
tokenized_ds = ds.map(preprocess_function, batched=True)
training_args = TrainingArguments(
output_dir='/tmp',
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = BaalTransformersTrainer(model=model, args=training_args, tokenizer=tokenizer,
data_collator=data_collator, )
print("Total examples", len(tokenized_ds))
print(
f"Dataloader length={len(trainer.get_eval_dataloader(tokenized_ds))}, batch_size={training_args.per_device_eval_batch_size}")
trainer.predict_on_dataset(tokenized_ds, iterations=10)
I just opened #281 which should allow you to run your experiment.
If you can install Baal from source from this branch, you could update your code with:
trainer = BaalTransformersTrainer(
model=model,
replicate_in_memory=False,
args=args,
)
and that should fix it.
In any case, I should be able to get the PR merged this week and will release a minor version along with it :)
Im sorry for any miscommunication, what i meant by manually setting the batch_size to 8 is the following:
predictions = np.empty((0, model.num_labels, iterations))
for chunk in df_chunker(tokenized_X, batch_size=2):
dataset = Dataset.from_pandas(chunk)
_predictions: NDArray[
(batch_size, model.num_labels, iterations), np.float32
] = trainer.predict_on_dataset(dataset, iterations=iterations)
predictions = np.concatenate((predictions, _predictions), axis=0)
where
def df_chunker(
df: pd.DataFrame, batch_size: int = 1000
) -> Generator[pd.DataFrame, None, None]:
"""
Splits a pandas DataFrame into smaller chunks of a specified batch size.
Args:
df (pandas.DataFrame): The DataFrame to be split.
batch_size (int): The number of rows in each chunk.
Yields:
pandas.DataFrame: A chunk of the original DataFrame with the specified number of rows.
"""
for i in range(0, len(df), batch_size):
yield df.iloc[i : i + batch_size]
So the iterations are still 30, my max batch_size is 8 so the number of inputs its loading into the model is 8*30. Im basically forcing the predict function to only be able to take 8 inputs at a time. However, when i dont force chunk the batch_size, it seems to be predicting in much larger batches, which cause memory overflows. What is so weird about this bug, is that it might not even be BAAL related, it might just seem so because of the progress bar.
Anyway, ill install BAAL from #281 and see whether that removes the need for my forced chunking solution. Thanks!
hi @hugocool , Wanted to see if the above issue was resolved with the fix from #281?
Describe the bug When setting the
batch_size
to 2 in BAAL, it appears to be using abatch_size
of 16 instead. This is causing a CUDA out of memory error. Despite settingper_device_eval_batch_size
andtrain_batch_size
to 2 inTrainingArguments
, thepredict_on_dataset
function seems to be using abatch_size
of 16, I am letting BAAL sort 1e6 (1 million) examples, when i run the predict_on_dataset function i see the following in the logs:meaning it is using a batch_size of 16, instead of the specified 2. A batch size of 8 would also work (if i manually downsample the input dataframe to be 8 inputs).
To Reproduce
which gives:
Expected behavior The predict_on_dataset function should respect the batch_size specified in TrainingArguments and not cause a CUDA out of memory error.
Version (please complete the following information):
version : 1.9.1
description : Library to enable Bayesian active learning in your research or labeling work.
dependencies
Additional context I am running this on AWS batch on a p3 instance.