Baal seems to ignore eval_batch_size causing gpu memory issues

hugocool commented 9 months ago

Describe the bug When setting the batch_size to 2 in BAAL, it appears to be using a batch_size of 16 instead. This is causing a CUDA out of memory error. Despite setting per_device_eval_batch_size and train_batch_size to 2 in TrainingArguments, the predict_on_dataset function seems to be using a batch_size of 16, I am letting BAAL sort 1e6 (1 million) examples, when i run the predict_on_dataset function i see the following in the logs:

0%| | 0/62500 [00:00<?, ?it/s] 0%| | 0/62500 [00:01<?, ?it/s]

meaning it is using a batch_size of 16, instead of the specified 2. A batch size of 8 would also work (if i manually downsample the input dataframe to be 8 inputs).

To Reproduce

    model = patch_module(model)
    from transformers import TrainingArguments

    args = TrainingArguments(output_dir="/", per_device_eval_batch_size=2)
    args = args.set_dataloader(
        train_batch_size=2, eval_batch_size=2, auto_find_batch_size=False
    )

    trainer = BaalTransformersTrainer(
        model=model,
        args=args,
    )

    dataset = Dataset.from_pandas(tokenized_X)
    predictions = trainer.predict_on_dataset(dataset, iterations=30)

which gives:

"CUDA out of memory. Tried to allocate 1.41 GiB (GPU 0; 15.78 GiB total capacity; 14.49 GiB already allocated; 397.75 MiB free; 14.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

Expected behavior The predict_on_dataset function should respect the batch_size specified in TrainingArguments and not cause a CUDA out of memory error.

Version (please complete the following information):

OS: ubuntu 22
Python 3.10.4:
Baal:
version : 1.9.1
description : Library to enable Bayesian active learning in your research or labeling work.

dependencies

h5py >=3.4.0,<4.0.0
matplotlib >=3.4.3,<4.0.0
numpy >=1.21.2,<2.0.0
Pillow >=6.2.0
scikit-learn >=1.0.0,<2.0.0
scipy >=1.7.1,<2.0.0
structlog >=21.1.0,<22.0.0
torch >=1.6.0
torchmetrics >=0.9.3,<0.10.0
tqdm >=4.62.2,<5.0.0

Additional context I am running this on AWS batch on a p3 instance.

Dref360 commented 9 months ago

Hello,

Thank you for submitting the issue. I should be able to take a look over the weekend.

I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend.

hugocool commented 9 months ago

I know! I digged into the code, and that’s why I did the args set_dataloader. But apparently it’s getting ignored, so I don’t know how to trouble shoot this, or maybe there is some environment variable that is playing a role here, idk..

Hugo Evers On 20 Oct 2023 at 15:56 +0200, Frédéric Branchaud-Charron @.***>, wrote:

Hello, Thank you for submitting the issue. I should be able to take a look over the weekend. I'm a bit puzzled because we simply call self.get_eval_dataloader(dataset) which is managed by HuggingFace. I'll know more this weekend. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Dref360 commented 9 months ago

I was thinking about it and maybe it's because of the stacking we perform. For our HF implementation, we always perform MC-Dropout in a single pass meaning that batch_size=2 will result in a batch size of 2 * ITERATIONS to be fed to the model. You said that a manual batch_size of 8 is your maximum so 2*30=60 which is too much.

Our ModelWrapper implementation has a flag replicate_in_memory which avoid stacking, but we have it for HF.

It is fairly trivial to add this feature so I'll do that.

Progress bar stuff

I just tested the progress bar problem and it seems to work. :thinking:

from datasets import load_dataset
from transformers import pipeline, TrainingArguments, DataCollatorWithPadding

from baal.transformers_trainer_wrapper import BaalTransformersTrainer

TEXT_COL = 'sentence'
ds = load_dataset('sst2')['test'].remove_columns('label')
pipe = pipeline('text-classification', model='distilbert-base-uncased-finetuned-sst-2-english')
tokenizer = pipe.tokenizer
model = pipe.model

def preprocess_function(examples):
    return tokenizer(examples[TEXT_COL], truncation=True)

tokenized_ds = ds.map(preprocess_function, batched=True)

training_args = TrainingArguments(
    output_dir='/tmp',
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = BaalTransformersTrainer(model=model, args=training_args, tokenizer=tokenizer,
                                  data_collator=data_collator, )
print("Total examples", len(tokenized_ds))
print(
    f"Dataloader length={len(trainer.get_eval_dataloader(tokenized_ds))}, batch_size={training_args.per_device_eval_batch_size}")
trainer.predict_on_dataset(tokenized_ds, iterations=10)

Dref360 commented 9 months ago

I just opened #281 which should allow you to run your experiment.

If you can install Baal from source from this branch, you could update your code with:

trainer = BaalTransformersTrainer(
        model=model,
        replicate_in_memory=False,
        args=args,
    )

and that should fix it.

In any case, I should be able to get the PR merged this week and will release a minor version along with it :)

hugocool commented 9 months ago

Im sorry for any miscommunication, what i meant by manually setting the batch_size to 8 is the following:

    predictions = np.empty((0, model.num_labels, iterations))

    for chunk in df_chunker(tokenized_X, batch_size=2):
        dataset = Dataset.from_pandas(chunk)
        _predictions: NDArray[
            (batch_size, model.num_labels, iterations), np.float32
        ] = trainer.predict_on_dataset(dataset, iterations=iterations)
        predictions = np.concatenate((predictions, _predictions), axis=0)

where

def df_chunker(
    df: pd.DataFrame, batch_size: int = 1000
) -> Generator[pd.DataFrame, None, None]:
    """
    Splits a pandas DataFrame into smaller chunks of a specified batch size.

    Args:
        df (pandas.DataFrame): The DataFrame to be split.
        batch_size (int): The number of rows in each chunk.

    Yields:
        pandas.DataFrame: A chunk of the original DataFrame with the specified number of rows.
    """
    for i in range(0, len(df), batch_size):
        yield df.iloc[i : i + batch_size]

So the iterations are still 30, my max batch_size is 8 so the number of inputs its loading into the model is 8*30. Im basically forcing the predict function to only be able to take 8 inputs at a time. However, when i dont force chunk the batch_size, it seems to be predicting in much larger batches, which cause memory overflows. What is so weird about this bug, is that it might not even be BAAL related, it might just seem so because of the progress bar.

Anyway, ill install BAAL from #281 and see whether that removes the need for my forced chunking solution. Thanks!

parmidaatg commented 2 months ago

hi @hugocool , Wanted to see if the above issue was resolved with the fix from #281?

baal-org / baal

Baal seems to ignore eval_batch_size causing gpu memory issues #280

Progress bar stuff