Closed dorcoh closed 5 years ago
Yeah, we used to have something like this in an earlier version of this library. These days we just use very small test fixtures when we're testing things out, so we don't typically need this.
My immediate reaction is to suggest writing a SubsampledDatasetReader
that just wraps another dataset reader and does something like yield from self._sub_dataset_reader._read()[:k]
. I think that would work, and would be just a few lines of code.
Sounds good, I think it might also be useful since nowadays it’s common to train on a subset of the data such as 10%, 20%, etc.
As the way in which you might want to subsample is task dependent, I think we can close this, as it is quite an easy thing for people to do themselves and i'm not sure how broadly useful it is 👍
Sure, for completeness I attach here the implementation, in my case for SRL:
from allennlp.data.dataset_readers.semantic_role_labeling import SrlReader
from itertools import islice
from overrides import overrides
class SubsampledSrlReader(SrlReader):
def __init__(self, num_samples, **kwargs):
super().__init__(**kwargs)
self.k = num_samples
@overrides
def _read(self, datapath):
yield from islice(super()._read(datapath), self.k)
Hi, first I'd like to say many thanks for this great project, it surely helps language researchers and engineers through the world
I explored the project code and could not find way to read a subset of the data through
DatasetReader
. I think it could be useful for the purpose of development phase where we usually don't want to load the entire dataset. Worth noting there's a parameter in the iterator classDataIterator.instances_per_epoch
which limits the number of samples per epoch, however as I understand it's called after loading the entire dataset to memoryI searched a bit and found there's a solution to limit python generators via its
itertools
module. Thought about wrapping the call to the generic methodDatasetReader._read()
inDatasetReader.read()
, let me please know what you think.In any case it's pretty easy to bypass this issue by manually creating a folder which contains subset of the data, but I think this way is more convenient.