deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

Any tools or plans for training data resampling and reordering? #798

Closed johann-petrak closed 2 years ago

johann-petrak commented 3 years ago

If I understand things correctly, the datasilo preprocesses the training data once, converts into the required tensors and then uses those tensors in exactly the same way each epoch.

Now for some experiments and training approaches one may want to deviate from that:

For most of the above, caching the generated tensors somehow would still be useful and necessary if the data does not fit into memory.

In order to do these things, what would be the best and most compatible way forward?

Timoeller commented 3 years ago

I think there are a couple of interesting thoughts.

reorder

The last time I checked the DataLoader I thought it would shuffle automatically. But now I realize the default value for shuffle is False. Edit: We actually give the DataLoader a RandomSampler for training (see here), so we should be fine.

resample

This is more a research related feature, still could be interesting. Do you have papers in mind that show significant improvments with those strategies? For classification at least we have a class_weights parameter, which I believe is a smarter way than resampling - but it was increasingly hard to tune and created problems in other parts of FARM (model conversion, saving + loading, API usage). Any ideas on your end @johann-petrak ?

modify/generate

You mean input augmentation? I think there is a lot out there for this. Do you have a specific method in mind? Would be interesting to get Text augmentation feature into FARM :)

johann-petrak commented 3 years ago

Thanks for the feedback! In general, this is all mostly about things I am planning to try for a classification head I am working on. For the experiments needed for this I would like to use the FARM framework as well and if necessary, add a line of code or two, either just for my own joy or share it as a PR if considered useful.

For reordering, I was also thinking about reordering strategies, where we can e.g. control the distribution of classes in each batch (class stratified batches) or other properties over the entire epoch (basically, some callback or class that takes responsibility for the order).

I am not sure if/why class weights are a smarter way than duplicating and properly distributing sample duplicates among batches, as the gradient will look very different in those two approaches which may be a problem for very imbalanced class distributions. Do not know a paper though.

For modify/generate: yes, basically input augmentation. For informal texts, it may be useful for improving a model to deal with lazy orthography, but again, mainly something I plan to try.

I think all the above things are related to how the processor and data silo works, so I was wondering if somebody has already pondered where or how to fit them in there.

When it comes to priority, stratified batches and instance resampling/duplication are probably highest on my list.

Timoeller commented 3 years ago

Hey Johann,

I am not sure if/why class weights are a smarter way than duplicating and properly distributing sample duplicates among batches

Agreed, smart is really not a good word here, what about easy and already implemented? :smile:

Honestly all those things are not top priority for us but we are happy about your quality contributions and it will for sure improve FARM. How about we proceed with stratified batches first and work our way along your propositions? I believe we can use the https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler for it but need some small snippet to calculate the weights?

johann-petrak commented 3 years ago

Thanks for the pointer. Yes, stratified batches would sure be a good starting point! I will have a look as soon as I get back to this in a week or so.

johann-petrak commented 3 years ago

I think the most flexible way to approach this would be if we could just specify a sampler to use instead of the distribute/random sampler used now by default in the data silo. The problem is that the sampler gets initialized in the silo with the dataset, but ideally we want to parametrize our own silo implementation.

So either we provide to the DataSilo

Personally I like the second solution much better.

One other problem i have at the moment is that for some of the things I would like to do, the sampler would need to know the epoch and global step it is creating:

The only way to achieve this for the training loop to set the epoch using samplerinstance.set_epoch(epoch) before fetching the first batch of each epoch and setting the global step using sampleinstance.set_global_step(globalsteop) before fetching the batch for that global step. The method set_epoch is already supported and required by the torch DistributedSampler.

So in train.py we could do something like

sampler = train_data_loader.sampler

if hasattr(sampler, "set_epoch"):  # if we have a sampler that supports it, set the epoch before starting the loop for that epoch
    sampler.set_epoch(epoch)
if hasattr(sampler, "set_global_step"):   # set the global step for the first batch before we start the epoch
    sampler.set_gloabl_step(self.global_step)
for step, batch in .... : 
   [...]
   if hasattr(sampler, "set_global_step"):  # set the global step for the next iteration
      sampler.set_gloabl_step(self.global_step)
stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.