allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.71k stars 2.24k forks source link

The DataLoader Needs to Handle Dirty Examples. #5709

Closed Alexixu closed 1 year ago

Alexixu commented 1 year ago

Is your feature request related to a problem? Please describe. I need to discard examples when I read the data. BUT the data loader will stop when the data reader return None or raising Exception. Empty Instance can not satisfy either.

Describe the solution you'd like Data loader can automatic discard the instance when data reader return None.

Describe alternatives you've considered Define a Type of Exception like DiscardException to implement this logic gracefully.

AkshitaB commented 1 year ago

@Alexixu Possibly your usecase be solved by handling empty instances in the (custom) DatasetReader you're using. If not, please share more details on what dataset reader you are running this with.

Alexixu commented 1 year ago

@AkshitaB The dataset reader is custom class inherit from DatasetReader. Empty instances is ok if the data loader can handle this empty logic. Discarding empty instance is the direct way to do so. But the default implement of DataLoader has no such logic. In my view, throwing an Exception is more suitable for corrupt example and DataLoader catch this Exception and discard examples.

dirkgr commented 1 year ago

@Alexixu, you can do this in the DatasetReader if you override how _read() works. You can return something like None from DatasetReader.text_to_instance(), and then do the right thing in _read().

Alexixu commented 1 year ago

@dirkgr I have tried this, but the default DataLoader implement can not handle None object, And it will throw an Exception of "None type has no index function".

I suggest it should be handled in an obvious way by Defining a concrete Exception and adding a try catch logic in DataLoader implement.

dirkgr commented 1 year ago

What I'm saying is, you can change this behavior in your own DatasetReader, where you override the _read() method to throw away the None objects.

github-actions[bot] commented 1 year ago

This issue is being closed due to lack of activity. If you think it still needs to be addressed, please comment on this thread 👇

Alexixu commented 1 year ago

@dirkgr I have tried that exactly, by implement _read function return None object. But the Data Loader (not the Dataset Reader) which call the text_to_instance function can not handle None object.

dirkgr commented 1 year ago

The _read() function should not return None. The _read() function is where you detect None and throw it away (instead of returning it).

Think of it this way: From _read() you have to return an iterable of instances. AllenNLP does not care how you do this. It only cares that _read() returns an iterable of instances. So you can do whatever you want inside of _read(), including skipping instances.