Lightning-Universe / lightning-transformers

Flexible components pairing 🤗 Transformers with :zap: Pytorch Lightning
https://lightning-transformers.readthedocs.io
Apache License 2.0
610 stars 77 forks source link

Improve DataModule support for large custom data #170

Closed SeanNaren closed 3 years ago

SeanNaren commented 3 years ago

🚀 Feature

In certain cases datasets are extremely large and either should not be tokenized instantly, (rather done on the fly) or a subset that can be tokenized in pieces.

This may also tie into #22 which will introduce data pipelines, making it easier to work this in.

talolard commented 3 years ago

Hi, I asked for this so thought I'd add some thoughts.

Don't Copy on DDP

When I use DDP on a single machine, each worker/GPU makes a copy of the dataloader and thus the dataset. This is suboptimal for two reasons

  1. If my dataset has a comple/long running init (like tokenizing a lot of text) the startup time is frustratingly long.
  2. I'm spending lots of memory on all those copies of the dataset.

Ideally the dataset dataloader would be shared via shared memory between processes, not sure how feasible that is in PT.

Shuffle Iterable Datasets

A way to workaround the above is to use an iterable dataset. But in DDP I need to implement some mechanism to "shuffle", otherwise each dataloader is returning the same thing and my batches are just copies of the same thing.

Collation Strategies (Batching)

The thing with batching for NLP is that padding and truncation depend on the longest item in the batch. This makes the separation of concerns between the dataset and collation function ambiguous. e.g. should I tokenize each item seperatly in the dataset or do it in the collation function.

If I tokenize in the dataset I need to truncate/pad myself in the collator, which is wasteful because HF's tokenizers already do it for me correctly and fast. On the other hand, if I tokenize in the collator I'm tokenizing on every batch, which slows down training (though I never measured by how much)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.