QuestionAnsweringInputBase is returning incorrect number of samples in batch

Lightning-Universe / lightning-flash

Your PyTorch AI Factory - Flash enables you to easily configure and run complex AI recipes for over 15 tasks across 7 data domains

https://lightning-flash.readthedocs.io

Apache License 2.0

1.74k stars 212 forks source link

QuestionAnsweringInputBase is returning incorrect number of samples in batch #1166

Closed mfojtak closed 2 years ago

mfojtak commented 2 years ago

🐛 Bug

In case of long QA context the Huggingface tokenizer divides tokenized output in chunks. Which is expected and correct. But load_sample function in QuestionAnsweringInputBase is returning collated sample which results in arbitrary sized batches ignoring batch_size specified. This may result in cuda OOM and other problems.

One sample per chunk is created instead of one sample per squad sample. It looks like the code tries to "utilize" all chunks even if they do not contain answer. Which might be useful but in this case IterableInput should be used. By default only one sample per squad sample should be returned and impossible answers ignored (unless not squad impossible answers).

To Reproduce

Steps to reproduce the behavior:

Code sample

datamodule = QuestionAnsweringData.from_squad_v2(
    train_file="/data/share/cuad/CUAD_v1/CUAD_v1.json",
    batch_size=2, max_source_length=4096, max_target_length=512 #2 samples per batch specified
)

model = QuestionAnsweringTask(backbone="google/bigbird-base-trivia-itc", 
                max_answer_length=512)

trainer = Trainer(max_epochs=3, gpus=1)
trainer.fit(model, datamodule=datamodule) #this crashes because arbitrary sized batches are returned

Here 2 samples per batch is requested. If sample context size > 4096 then multiple chunks are returned.

e.g. if first context size is 5000 and second context size is 3000 then 3 samples will be yielded from QuestionAnsweringInputBase.

Expected behavior

Correct number of samples in batch is returned.

Environment

PyTorch Version (e.g., 1.0): 1.10.2
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9
CUDA/cuDNN version: 11.2
GPU models and configuration: Tesla V100
Any other relevant information:

Additional context

Possible solutions: QuestionAnsweringInputBase should be based on IterableInput as number of samples is not known, or completely new iterable version is implemented separately.

Or "classic" Input would remain but one sample per squad sample must be returned.

ethanwharris commented 2 years ago

Hey @mfojtak Thanks for reporting this! Great analysis of the issue, I think having an iterable input for question answering would be a great way forward. If this something you'd like to help out with? There may be some hoops we need to jump through to get this to work with our API but we could help there :smiley:

mfojtak commented 2 years ago

Hi @ethanwharris Thanks for swift answer. I have already implemented the iterable version and tested it. It would be better though if two modes were supported - one with sample per squad and one with sample per chunk. I am also testing if impossible answers created by multiple chunks would benefit for model performance.

There is also some refactoring of base classes required for this to work. It might be better to use Python generator syntax instead of load_data and load_sample approach. It would be more intuitive and authoring of new datamodules easier.

Can I just create a pull request for this?

ethanwharris commented 2 years ago

@mfojtak Yes definitely happy to have a PR with it :smiley: Note that we have recently changed our text tasks to apply the tokenization in the collate function instead of the load_sample, but I think we could change it back for question answering in order to support the iterable approach. Could you open a PR with what you have? Then I can help with updating to our current API :smiley:

mfojtak commented 2 years ago

@ethanwharris let me implement it in the latest and greatest API. For this - could you please give me some clarity on how the new API works. In my understanding:

Input is now responsible only to load samples with no feature transformations (e.g. tokenization). InputTransform should be used instead to transform samples into features understood by the model. There is also collate functionality in InputTransform.

BUT - I noticed you created TransformersCollate callable class. How does this class fit into the picture?

ethanwharris commented 2 years ago

@mfojtak Sure :smiley: The main thing we tried to resolve is that the model should be responsible for the tokenization. It used to be (a few versions ago now) that you had to provide the same backbone argument to the datamodule and the task. We later added a mechanism for the datamodule and task to share state which allowed the model to set a tokenizer state that could then be used by the Input.load_sample. The main issue with the state mechanism was that it connected too many things together and it was unclear how different objects modified it, so we've removed it. We also have a way that models can set a collate function (this makes sense e.g. for object detection where different models may need the data collated in different ways). So the QA task currently works by creating a TextQuestionAnsweringCollate, which handles the tokenization, in the task init here: https://github.com/PyTorchLightning/lightning-flash/blob/4c624823d7992aa454b9f064d2714cae9b4a8893/flash/text/question_answering/model.py#L131

The problem with our current approach is that you wouldn't be able to implement the iterable tokenization, which I think would be a great feature to have. This is the approach I think we could take:

give the Input objects a reference to the trainer (the same way PL gives a trainer reference to the lightning module and datamodule, I think we could attach the trainer in the _*_dataloader hooks of the datamodule)
revert the QA input to apply the tokenization in the load_sample but getting the tokenizer from the model via self.trainer.lightning_module
implement the iterable variant (I guess this will be more or less identical to what you have already)

Hope that helps :smiley: sorry for the long message, let me know if you have any thoughts / comments

mfojtak commented 2 years ago

@ethanwharris Thanks for clarification. I fully understand your points and while implementing iterable version I was facing the exact issues you pointed out.

I agree the "get_state" approach is complicated and self.trainer is better.

The question is where tokenization should be happening. My suggestion is to put feature calculations in separate component that could be either attached to Datamodule or to model itself. It depends on the task at hand where it will be placed.

It looks like InputTransform is a good candidate for this component. But you did not mention it and I can see in the code that it is not used too often. Is Input/Output Transform API planned to be used in the future?

Benefits of attaching transform to the model: Model specific information and components could be accessed directly from the model (e.g. tokenizer). More compact models could be implemented that do full end-to-end job. Compact ONNX and torchscript could be generated (e.g. doing tokenization as well). Resulting in more robust and convenient serving.

The only question is about the Transform API - is it the right one future proof approach? Also - why you haven't implemented TextQuestionAnsweringCollate as Transform?

ethanwharris commented 2 years ago

The InputTransform is primarilly intended as a way for us to provide some default augmentations / transforms and to allow users to override them. It creates a bit of an issue where there are transforms that are required by the model in order to work because they don't stack in that you can only have one InputTransform operating at a time, so you could add a transform that augments text somehow but accidentally turn off tokenization. Usually the collate function cannot be changed without breaking the model, so we allow the model to dictate the collate function in those cases. The InputTransform is owned by the Input so we could allow both to have a reference to the trainer if needed.

I think it could make a lot of sense for the InputTransform to apply the tokenization as that could be something users want to override. It could also then be included in the serving pipeline as you descirbe.

How would this work with the iterable QA input? Can the contexts, questions, etc. be split into fixed sized chunks without also tokenizing them or would we need more of a special case there?

mfojtak commented 2 years ago

The situation is getting a little more complicated. First of all - there is a bug in SQUAD parsing as it doesn't count with more answers per question. But I can fix this easily.

However, it uncovered more design problems, bugs and confusions. The code bellow crashes because train Input class has some kind of logic which adds default_collate automatically. This function is crashing in scenario when samples have collections of different sizes. Input class should not be doing anything with the samples unless not explicitly specified. It should just yield samples as is and apply transform if specified.

datamodule = QuestionAnsweringData.from_squad_v2( #no InputTransofrm or collate specified, not connected with trainer or model
    train_file="/data/share/cuad/CUAD_v1/CUAD_v1.json",
    batch_size=2
)

dl = DataLoader(ds, batch_size=2)
for sample in dl:
    print(sample) #crash here because of default_collate

My plan now is to implement InputTransform for QA and fix SQUAD parsing bug and feature extraction bug.

However, it would be good to start decoupling concepts. Currently everything is stitched together in not so transparent way and it is not easy to understand behavior of the API. Also many things are happening implicitly which hinders implementation and debugging.

ethanwharris commented 2 years ago

The plan sounds good and agree with the decoupling :smiley:

Couple of comments:

in your example the from_squad_v2 has train_transform: INPUT_TRANSFORM_TYPE = InputTransform as a default, if you pass train_transform=None then that should disable the default_collate that is being added
When you create a DataLoader that does the collation right? So the default_collate is probably being added there. If you just want to access the uncollated samples you can do e.g.:
```
datamodule.train_dataset[0]
```

The only difference currently between using the datamodule outside of a trainer etc. is that the model needs to override the collate in order to own the tokenization. One option there is that we could change the default transform for QA to just return the list of samples from the collate.

karthikrangasai commented 2 years ago

Hello all,

The initial version before the data pipeline overhaul was using default_data_collator from the huggingface/transformers library directly to collate the pytorch tensors that were being output.

This has been changed I guess and I also overlooked this while going through the code.

Going through the conversation above, I agree that we could put the tokenization process in the InputTransform class but the question is would this be done :

per_batch or
per_sample or
like HF does (over the entire dataset at once, as it caches the tokenized dataset and reuses it when we train again)

In the previous version of the API, it would tokenize the entire datatset when creating the datamodule for the first time and then further calls to training would use load the cached dataset into the dataloader if I remember correctly.

ethanwharris commented 2 years ago

Hey @karthikrangasai Yes that's a good point, I had forgotten about it but one option is we could add a hook to the InputTransform that would allow for processing the whole data set. This could be quite useful for static preprocessing like tokenization or e.g. creating spectrograms in audio tasks.

karthikrangasai commented 2 years ago

Should we also provide the usecase when users would want to further transform the already processed dataset ? I am not sure if such a case exists, but that would be additional InputTransforms I assume.

ethanwharris commented 2 years ago

I think we could have the following API: Inside the *_dataloader hook of the DataModule:

If the model is available then attach it to the input / transform
Apply the per_dataset_transform (or whatever we call it)

In the per_dataset_transform:

If the model is available then tokenize and prepare the data
If the model isn't available then just do nothing (maybe give a warning?)

In the collate:

Just a flexible collate that will collate the batch or just give the list of samples if it's not possible

Let me know your thoughts :smiley:

mfojtak commented 2 years ago

Hi all, see my comments bellow

Going through the conversation above, I agree that we could put the tokenization process in the InputTransform class but the question is would this be done :

per_batch

Per batch because in general a transform might do batch level optimizations. In fact - transormers tokenizer operates on batches.

n your example the from_squad_v2 has train_transform: INPUT_TRANSFORM_TYPE = InputTransform as a default, if you pass train_transform=None then that should disable the default_collate that is being added

The input transform cannot be turned off even by setting to None.

Basically, the API should follow lightning API in order to be simplified in my opinion.

Dataset should just yield samples. Should have no knowledge about model, loader or trainer. Loader does batching, shuffling etc. The format of the sample in batch should be preserved. Should have no knowledge about model, or trainer. Transform could be attached to dataset or loader to perform additional transformations. Could even do tokenization but in this case tokenizer should be externally instatiated and explicitly passed to the transform. Transform could be attached to model to do tokenization or feature extraction. If attached to model, transform could use model specific info to e.g. instatiate tokenizer.

Question answering is used as an example here. All the above applies in general to all tasks.

This is more less how it is designed in Lightning. Flash in my opinion should implement specific tasks only. There are parts where it feels like flash implements lightning in lightning.

Please share your opinions.

ethanwharris commented 2 years ago

Yes, @mfojtak I generally agree with all of your points.

The Dataset within flash is the Input object.
The InputTransform is owned by the Input (the same way you would normally pass a list of transforms e.g. to a torchvision dataset).
The dataloader is then responsible for sampling and collation (the InputTransform can overide collation, but IMO this could be removed to allow the per sample transforms to live entirely within the _getitem of the Input object).
The dataloader is also responsible for batch-level transforms that are applied after collation.
The datamodule is responsible for on_device transforms.

The above is all generally quite clean except for a few caveats:

collation is not really optional, so it doesn't really belong in the input transform, sometimes the model needs to control collation but that is the only time you would want to change it
there is a mechanism for per_sample transforms to happen on device that is quite convoluted (and a source of a lot of complexity in the code). We don't use it and I think it would be fine to say that on device transforms should be applied per batch.

Removing the above two would simplify the transforms significantly.

Tokenization

As for where the tokenization should happen, I don't think any transforms should be attached to the model. The main reason for this is that intensive transforms (like tokenization) should be executed in parralell within the dataloader workers (that is, in the getitem or injected into the collate function) for performance reasons.

Could even do tokenization but in this case tokenizer should be externally instatiated and explicitly passed to the transform.

Let's try to design the API here. We could have the following:

data_module = QuestionAnsweringData.from_*(
    transform_kwargs={"backbone": ...},
    ...
)

model = QuestionAnsweringTask(backbone=...)

We used to have something like that but didn't like it as you had to specify the backbone in two places. Alternatively we could do more explicit:

data_module = QuestionAnsweringData.from_*(
    transform_kwargs={"backbone": ...},
    train_transform=HuggingFaceQuestionAnsweringTokenizationTransform,
    ...
)

model = QuestionAnsweringTask(backbone=...)

Personally, I think that allowing the transform to have a reference to the trainer (and thus the model) is fine since the datamodule gets this reference already in PL. That would then allow for the following, which is my personal preference:

data_module = QuestionAnsweringData.from_*(
    ...
)

model = QuestionAnsweringTask(backbone=...)

Note that all of the above would allow for the tokenization to be overriden manually by providing a custom input transform to the from_* method.

I propose the following steps forward:

open seperate issues for the much needed simplification / cleaning of the input transform, possibly based on my two suggestions but happy to have other discussions there
decide on the API we want around tokenization and make the change
figure out how this ties in to an iterable dataset version of the QAInput. Specifically, curently we have to tokenize before we can know if one sample becomes many samples. To have an iterable dataset that works we need to know in the iteration how each question / context will be split, but we have to do this somehow without tokenizing it...

mfojtak commented 2 years ago

How about this:

tokenizer = ....
data_module = QuestionAnsweringData.from_*(
    transform=HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer), #no kwargs
    ...
)

model = QuestionAnsweringTask(data_module=..., backbone=)

Be aware that is some (not uncommon) cases the tokenizer backbone might be different from model. E.g. you are adding extra tokens which requires instatiation of your own tokenizer. Or you are using different tokenization library.

or:

data_module = QuestionAnsweringData.from_*(
    ...
)

model = QuestionAnsweringTask(data_module=..., backbone=...)
               __int__(data_module, backbone):
                      if not data_module.transform and not self.input_transform: #or other smart heuristics to adapt any input type
                              tokenizer = Tokenizer(backbone)
                              self.input_transform = HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer)

data_module = QuestionAnsweringData.from_*(
    ...
)
transform = HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...)
model = QuestionAnsweringTask(data_module=..., backbone=..., input_transform=...)

It could get even smarter and model can propose transforms automatically based on data module output type information. Adapters can be registered to adapt arbitrary input into model. This is how many dataflow frameworks work (e.g. ffmpeg). In above code snippets - all objects live independently. No object is setting other object's properties. And it feels more pythonic and OOP. E.g. why separate parameters for class and args if we can do single_param=class(args).

karthikrangasai commented 2 years ago

I would prefer the API to look like:

data_module = QuestionAnsweringData.from_*(
    transform_kwargs={"backbone": ...},
    train_transform=HuggingFaceQuestionAnsweringTokenizationTransform,
    ...
)

model = QuestionAnsweringTask(backbone=...)

This would fit in well with something like a HuggingFace integration which could come for all kinds of operations and tasks.
Currently, the text tasks only use HuggingFace transformers but we shouldn't forget the roots of this domain i.e. RNN and other types of models which have a completely different kind of tokenization approach that might not depend on the internal Model state. (Just to keep the API consistent across all text tasks.) We could have further extensions like in the below format where the DataModule is responsible for loading in the dataset through the loader and then setting up the transforms for how the model would like its inputs, be it tokenization or word embeddings. The Task (aka model) only knows that it wants inputs in a certain format.
```
data_module = QuestionAnsweringData.from_*(
train_transform=GloveEmbeddingTransform,
transform_kwargs={"name": '6B', "dim":50},
...
)
```

rnn_model = RNN()

model = QuestionAnsweringTask(backbone=rnn_model)

ethanwharris commented 2 years ago

Yes, @mfojtak definitely agree with your points there. Personally I would be very happy to see this API for transforms:

data_module = DataModule.from_*(
    transform=MyCustomTransform(image-size=128, ...),
    ...
)

This would involve a slight change in how the transforms work underneath and then removing the *_transform arguments in favour of a single one (you can customize the input transform object for the stage anyway so it doesn't make sense to have multiple arguments) and also removing the transform_kwargs.

I would like to avoid passing a reference to the data module to the model, so with the above changes I think we could have something like this:

data_module = QuestionAnsweringData.from_*(
    transform=HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...),
    ...
)

model = QuestionAnsweringTask(backbone=)

Then as @karthikrangasai suggests we would also be able to have e.g. glove or word2vec embedding and stuff like that as optional alternatives to the HF tokenization.

@mfojtak Are you on the PL slack? It might be easier for us to discus there :smiley:

mfojtak commented 2 years ago

@ethanwharris perfect! I guess we agree :-)

I would like to avoid passing a reference to the data module to the model

Absolutely agree here too. My mistake - It is trainer which links data with model. I am still not used to the "task" semantics which is in fact a model. But the idea remains.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.