Closed SeanNaren closed 3 years ago
Hi, I asked for this so thought I'd add some thoughts.
When I use DDP on a single machine, each worker/GPU makes a copy of the dataloader and thus the dataset. This is suboptimal for two reasons
Ideally the dataset dataloader would be shared via shared memory between processes, not sure how feasible that is in PT.
A way to workaround the above is to use an iterable dataset. But in DDP I need to implement some mechanism to "shuffle", otherwise each dataloader is returning the same thing and my batches are just copies of the same thing.
The thing with batching for NLP is that padding and truncation depend on the longest item in the batch. This makes the separation of concerns between the dataset and collation function ambiguous. e.g. should I tokenize each item seperatly in the dataset or do it in the collation function.
If I tokenize in the dataset I need to truncate/pad myself in the collator, which is wasteful because HF's tokenizers already do it for me correctly and fast. On the other hand, if I tokenize in the collator I'm tokenizing on every batch, which slows down training (though I never measured by how much)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🚀 Feature
In certain cases datasets are extremely large and either should not be tokenized instantly, (rather done on the fly) or a subset that can be tokenized in pieces.
This may also tie into #22 which will introduce data pipelines, making it easier to work this in.