allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

Limit dataset size #3099

Closed dorcoh closed 5 years ago

dorcoh commented 5 years ago

Hi, first I'd like to say many thanks for this great project, it surely helps language researchers and engineers through the world

I explored the project code and could not find way to read a subset of the data through DatasetReader. I think it could be useful for the purpose of development phase where we usually don't want to load the entire dataset. Worth noting there's a parameter in the iterator class DataIterator.instances_per_epoch which limits the number of samples per epoch, however as I understand it's called after loading the entire dataset to memory

I searched a bit and found there's a solution to limit python generators via its itertools module. Thought about wrapping the call to the generic method DatasetReader._read() in DatasetReader.read(), let me please know what you think.

In any case it's pretty easy to bypass this issue by manually creating a folder which contains subset of the data, but I think this way is more convenient.

matt-gardner commented 5 years ago

Yeah, we used to have something like this in an earlier version of this library. These days we just use very small test fixtures when we're testing things out, so we don't typically need this.

My immediate reaction is to suggest writing a SubsampledDatasetReader that just wraps another dataset reader and does something like yield from self._sub_dataset_reader._read()[:k]. I think that would work, and would be just a few lines of code.

dorcoh commented 5 years ago

Sounds good, I think it might also be useful since nowadays it’s common to train on a subset of the data such as 10%, 20%, etc.

DeNeutoy commented 5 years ago

As the way in which you might want to subsample is task dependent, I think we can close this, as it is quite an easy thing for people to do themselves and i'm not sure how broadly useful it is 👍

dorcoh commented 5 years ago

Sure, for completeness I attach here the implementation, in my case for SRL:

from allennlp.data.dataset_readers.semantic_role_labeling import SrlReader
from itertools import islice
from overrides import overrides

class SubsampledSrlReader(SrlReader):
    def __init__(self, num_samples, **kwargs):
        super().__init__(**kwargs)
        self.k = num_samples

    @overrides
    def _read(self, datapath):
        yield from islice(super()._read(datapath), self.k)