Closed mfojtak closed 2 years ago
Hey @mfojtak Thanks for reporting this! Great analysis of the issue, I think having an iterable input for question answering would be a great way forward. If this something you'd like to help out with? There may be some hoops we need to jump through to get this to work with our API but we could help there :smiley:
Hi @ethanwharris Thanks for swift answer. I have already implemented the iterable version and tested it. It would be better though if two modes were supported - one with sample per squad and one with sample per chunk. I am also testing if impossible answers created by multiple chunks would benefit for model performance.
There is also some refactoring of base classes required for this to work. It might be better to use Python generator syntax instead of load_data and load_sample approach. It would be more intuitive and authoring of new datamodules easier.
Can I just create a pull request for this?
@mfojtak Yes definitely happy to have a PR with it :smiley: Note that we have recently changed our text tasks to apply the tokenization in the collate function instead of the load_sample
, but I think we could change it back for question answering in order to support the iterable approach. Could you open a PR with what you have? Then I can help with updating to our current API :smiley:
@ethanwharris let me implement it in the latest and greatest API. For this - could you please give me some clarity on how the new API works. In my understanding:
Input is now responsible only to load samples with no feature transformations (e.g. tokenization). InputTransform should be used instead to transform samples into features understood by the model. There is also collate functionality in InputTransform.
BUT - I noticed you created TransformersCollate callable class. How does this class fit into the picture?
@mfojtak Sure :smiley: The main thing we tried to resolve is that the model should be responsible for the tokenization. It used to be (a few versions ago now) that you had to provide the same backbone argument to the datamodule and the task. We later added a mechanism for the datamodule and task to share state which allowed the model to set a tokenizer state that could then be used by the Input.load_sample. The main issue with the state mechanism was that it connected too many things together and it was unclear how different objects modified it, so we've removed it. We also have a way that models can set a collate function (this makes sense e.g. for object detection where different models may need the data collated in different ways). So the QA task currently works by creating a TextQuestionAnsweringCollate
, which handles the tokenization, in the task init here: https://github.com/PyTorchLightning/lightning-flash/blob/4c624823d7992aa454b9f064d2714cae9b4a8893/flash/text/question_answering/model.py#L131
The problem with our current approach is that you wouldn't be able to implement the iterable tokenization, which I think would be a great feature to have. This is the approach I think we could take:
_*_dataloader
hooks of the datamodule)load_sample
but getting the tokenizer from the model via self.trainer.lightning_module
Hope that helps :smiley: sorry for the long message, let me know if you have any thoughts / comments
@ethanwharris Thanks for clarification. I fully understand your points and while implementing iterable version I was facing the exact issues you pointed out.
I agree the "get_state" approach is complicated and self.trainer is better.
The question is where tokenization should be happening. My suggestion is to put feature calculations in separate component that could be either attached to Datamodule or to model itself. It depends on the task at hand where it will be placed.
It looks like InputTransform is a good candidate for this component. But you did not mention it and I can see in the code that it is not used too often. Is Input/Output Transform API planned to be used in the future?
Benefits of attaching transform to the model: Model specific information and components could be accessed directly from the model (e.g. tokenizer). More compact models could be implemented that do full end-to-end job. Compact ONNX and torchscript could be generated (e.g. doing tokenization as well). Resulting in more robust and convenient serving.
The only question is about the Transform API - is it the right one future proof approach? Also - why you haven't implemented TextQuestionAnsweringCollate as Transform?
The InputTransform
is primarilly intended as a way for us to provide some default augmentations / transforms and to allow users to override them. It creates a bit of an issue where there are transforms that are required by the model in order to work because they don't stack in that you can only have one InputTransform
operating at a time, so you could add a transform that augments text somehow but accidentally turn off tokenization. Usually the collate function cannot be changed without breaking the model, so we allow the model to dictate the collate function in those cases. The InputTransform
is owned by the Input
so we could allow both to have a reference to the trainer if needed.
I think it could make a lot of sense for the InputTransform
to apply the tokenization as that could be something users want to override. It could also then be included in the serving pipeline as you descirbe.
How would this work with the iterable QA input? Can the contexts, questions, etc. be split into fixed sized chunks without also tokenizing them or would we need more of a special case there?
The situation is getting a little more complicated. First of all - there is a bug in SQUAD parsing as it doesn't count with more answers per question. But I can fix this easily.
However, it uncovered more design problems, bugs and confusions. The code bellow crashes because train Input class has some kind of logic which adds default_collate automatically. This function is crashing in scenario when samples have collections of different sizes. Input class should not be doing anything with the samples unless not explicitly specified. It should just yield samples as is and apply transform if specified.
datamodule = QuestionAnsweringData.from_squad_v2( #no InputTransofrm or collate specified, not connected with trainer or model
train_file="/data/share/cuad/CUAD_v1/CUAD_v1.json",
batch_size=2
)
dl = DataLoader(ds, batch_size=2)
for sample in dl:
print(sample) #crash here because of default_collate
My plan now is to implement InputTransform for QA and fix SQUAD parsing bug and feature extraction bug.
However, it would be good to start decoupling concepts. Currently everything is stitched together in not so transparent way and it is not easy to understand behavior of the API. Also many things are happening implicitly which hinders implementation and debugging.
The plan sounds good and agree with the decoupling :smiley:
Couple of comments:
from_squad_v2
has train_transform: INPUT_TRANSFORM_TYPE = InputTransform
as a default, if you pass train_transform=None
then that should disable the default_collate that is being addedDataLoader
that does the collation right? So the default_collate is probably being added there. If you just want to access the uncollated samples you can do e.g.:
datamodule.train_dataset[0]
The only difference currently between using the datamodule outside of a trainer etc. is that the model needs to override the collate in order to own the tokenization. One option there is that we could change the default transform for QA to just return the list of samples from the collate.
Hello all,
The initial version before the data pipeline overhaul was using default_data_collator
from the huggingface/transformers
library directly to collate the pytorch tensors that were being output.
This has been changed I guess and I also overlooked this while going through the code.
Going through the conversation above, I agree that we could put the tokenization
process in the InputTransform
class but the question is would this be done :
per_batch
or per_sample
or In the previous version of the API, it would tokenize the entire datatset when creating the datamodule
for the first time and then further calls to training would use load the cached dataset into the dataloader if I remember correctly.
Hey @karthikrangasai Yes that's a good point, I had forgotten about it but one option is we could add a hook to the InputTransform that would allow for processing the whole data set. This could be quite useful for static preprocessing like tokenization or e.g. creating spectrograms in audio tasks.
Should we also provide the usecase when users would want to further transform the already processed dataset ?
I am not sure if such a case exists, but that would be additional InputTransform
s I assume.
I think we could have the following API:
Inside the *_dataloader
hook of the DataModule:
per_dataset_transform
(or whatever we call it)In the per_dataset_transform
:
In the collate:
Let me know your thoughts :smiley:
Hi all, see my comments bellow
Going through the conversation above, I agree that we could put the
tokenization
process in theInputTransform
class but the question is would this be done :
per_batch
Per batch because in general a transform might do batch level optimizations. In fact - transormers tokenizer operates on batches.
- n your example the
from_squad_v2
hastrain_transform: INPUT_TRANSFORM_TYPE = InputTransform
as a default, if you passtrain_transform=None
then that should disable the default_collate that is being added
The input transform cannot be turned off even by setting to None.
Basically, the API should follow lightning API in order to be simplified in my opinion.
Dataset should just yield samples. Should have no knowledge about model, loader or trainer. Loader does batching, shuffling etc. The format of the sample in batch should be preserved. Should have no knowledge about model, or trainer. Transform could be attached to dataset or loader to perform additional transformations. Could even do tokenization but in this case tokenizer should be externally instatiated and explicitly passed to the transform. Transform could be attached to model to do tokenization or feature extraction. If attached to model, transform could use model specific info to e.g. instatiate tokenizer.
Question answering is used as an example here. All the above applies in general to all tasks.
This is more less how it is designed in Lightning. Flash in my opinion should implement specific tasks only. There are parts where it feels like flash implements lightning in lightning.
Please share your opinions.
Yes, @mfojtak I generally agree with all of your points.
Dataset
within flash is the Input
object.InputTransform
is owned by the Input
(the same way you would normally pass a list of transforms e.g. to a torchvision dataset).The above is all generally quite clean except for a few caveats:
Removing the above two would simplify the transforms significantly.
As for where the tokenization should happen, I don't think any transforms should be attached to the model. The main reason for this is that intensive transforms (like tokenization) should be executed in parralell within the dataloader workers (that is, in the getitem or injected into the collate function) for performance reasons.
Could even do tokenization but in this case tokenizer should be externally instatiated and explicitly passed to the transform.
Let's try to design the API here. We could have the following:
data_module = QuestionAnsweringData.from_*(
transform_kwargs={"backbone": ...},
...
)
model = QuestionAnsweringTask(backbone=...)
We used to have something like that but didn't like it as you had to specify the backbone in two places. Alternatively we could do more explicit:
data_module = QuestionAnsweringData.from_*(
transform_kwargs={"backbone": ...},
train_transform=HuggingFaceQuestionAnsweringTokenizationTransform,
...
)
model = QuestionAnsweringTask(backbone=...)
Personally, I think that allowing the transform to have a reference to the trainer (and thus the model) is fine since the datamodule gets this reference already in PL. That would then allow for the following, which is my personal preference:
data_module = QuestionAnsweringData.from_*(
...
)
model = QuestionAnsweringTask(backbone=...)
Note that all of the above would allow for the tokenization to be overriden manually by providing a custom input transform to the from_*
method.
I propose the following steps forward:
How about this:
tokenizer = ....
data_module = QuestionAnsweringData.from_*(
transform=HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer), #no kwargs
...
)
model = QuestionAnsweringTask(data_module=..., backbone=)
Be aware that is some (not uncommon) cases the tokenizer backbone might be different from model. E.g. you are adding extra tokens which requires instatiation of your own tokenizer. Or you are using different tokenization library.
or:
data_module = QuestionAnsweringData.from_*(
...
)
model = QuestionAnsweringTask(data_module=..., backbone=...)
__int__(data_module, backbone):
if not data_module.transform and not self.input_transform: #or other smart heuristics to adapt any input type
tokenizer = Tokenizer(backbone)
self.input_transform = HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer)
or
data_module = QuestionAnsweringData.from_*(
...
)
transform = HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...)
model = QuestionAnsweringTask(data_module=..., backbone=..., input_transform=...)
It could get even smarter and model can propose transforms automatically based on data module output type information. Adapters can be registered to adapt arbitrary input into model. This is how many dataflow frameworks work (e.g. ffmpeg). In above code snippets - all objects live independently. No object is setting other object's properties. And it feels more pythonic and OOP. E.g. why separate parameters for class and args if we can do single_param=class(args).
I would prefer the API to look like:
data_module = QuestionAnsweringData.from_*(
transform_kwargs={"backbone": ...},
train_transform=HuggingFaceQuestionAnsweringTokenizationTransform,
...
)
model = QuestionAnsweringTask(backbone=...)
data_module = QuestionAnsweringData.from_*(
train_transform=GloveEmbeddingTransform,
transform_kwargs={"name": '6B', "dim":50},
...
)
rnn_model = RNN()
model = QuestionAnsweringTask(backbone=rnn_model)
Yes, @mfojtak definitely agree with your points there. Personally I would be very happy to see this API for transforms:
data_module = DataModule.from_*(
transform=MyCustomTransform(image-size=128, ...),
...
)
This would involve a slight change in how the transforms work underneath and then removing the *_transform
arguments in favour of a single one (you can customize the input transform object for the stage anyway so it doesn't make sense to have multiple arguments) and also removing the transform_kwargs
.
I would like to avoid passing a reference to the data module to the model, so with the above changes I think we could have something like this:
data_module = QuestionAnsweringData.from_*(
transform=HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...),
...
)
model = QuestionAnsweringTask(backbone=)
Then as @karthikrangasai suggests we would also be able to have e.g. glove or word2vec embedding and stuff like that as optional alternatives to the HF tokenization.
@mfojtak Are you on the PL slack? It might be easier for us to discus there :smiley:
@ethanwharris perfect! I guess we agree :-)
I would like to avoid passing a reference to the data module to the model
Absolutely agree here too. My mistake - It is trainer which links data with model. I am still not used to the "task" semantics which is in fact a model. But the idea remains.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
In case of long QA context the Huggingface tokenizer divides tokenized output in chunks. Which is expected and correct. But load_sample function in QuestionAnsweringInputBase is returning collated sample which results in arbitrary sized batches ignoring batch_size specified. This may result in cuda OOM and other problems.
One sample per chunk is created instead of one sample per squad sample. It looks like the code tries to "utilize" all chunks even if they do not contain answer. Which might be useful but in this case IterableInput should be used. By default only one sample per squad sample should be returned and impossible answers ignored (unless not squad impossible answers).
To Reproduce
Steps to reproduce the behavior:
Code sample
Here 2 samples per batch is requested. If sample context size > 4096 then multiple chunks are returned.
e.g. if first context size is 5000 and second context size is 3000 then 3 samples will be yielded from QuestionAnsweringInputBase.
Expected behavior
Correct number of samples in batch is returned.
Environment
conda
,pip
, source): pipAdditional context
Possible solutions: QuestionAnsweringInputBase should be based on IterableInput as number of samples is not known, or completely new iterable version is implemented separately.
Or "classic" Input would remain but one sample per squad sample must be returned.