deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

Do devset stratification #819

Closed johann-petrak closed 3 years ago

johann-petrak commented 3 years ago

If there is a large class imbalance, the devset class distribution may be randomly off quite a bit which in turn will negatively impact the loss and metric estimates which in turn can lead to poor early stopping decisions.

I think we should add the parameter dev_stratification for the text classification processor and do stratified dev set creation if this parameter is True (default is False for backwards compatibility). This would also make it unnecessary to implement this for xval and holdout estimation, instead, we could inherit the devset stratification setting from the parent silo's processor.

johann-petrak commented 3 years ago

This means I would move the devset stratification approach I have implemented now for holdout estimation to DataSilo._create_dev_from_train().

So if we do this, the PR for this should get pulled before PRs 818 and 817. Actually, probably better to do it all in one PR for this issue, issue #811 and issue #812

Please let me know if you are in favour of doing this.

johann-petrak commented 3 years ago

Since I have to get a number of things running in the next weeks with or without FARM, I wonder if there is any way of how to plan larger additions with the FARM development team. I would still like to share the stuff I have to do and provide it to FARM, but if it is difficult and not possible without longer delays I would prefer to implement my own solution without FARM for the specific problems I need to solve. (mainly devset stratification, holdout estimation and correct cross validation, batch stratification and instance oversampling, all for the text classification task).

Timoeller commented 3 years ago

Hey Johan, totally understandable that you want to get feedback on your work quickly. If you createa single PR, without the old history, I can merge this very quickly (<24 hours). Does that sound OK?

For the future we can discuss alternatives - I send you an email with some dates for a call.

johann-petrak commented 3 years ago

Thanks - I will then now work on issues 811, 812, and this one (819) and provide one PR for these (without any annoying additional history). Will close the current PRs for 811 and 812.

Timoeller commented 3 years ago

fixed by #825