facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.51k stars 726 forks source link

Add support for subshards #166

Closed stephenroller closed 2 years ago

stephenroller commented 2 years ago

🚀 Feature Request

Fast forwarding our on-the-fly tokenizer can be very slow when our data shards are very large, taking over an hour in some cases.

One easy solution is to just chop the data into more shards. This requires manual labor, and as our corpus is composed of many hundreds of files now, this makes things annoying. So let's do this effectively in the data loader.

Sketch

https://github.com/facebookresearch/metaseq/blob/f5442a181c6b54dbcc1b56afc9c27b2092306e49/metaseq/tasks/streaming_language_modeling.py#L271

In practice, setting --data-subshards to 10 or 20 should sort us.

suchenzang commented 2 years ago

Wait this is super confusing logic.

use the epoch variable in order to skip documents: assuming subshards of 10, then on epoch 1 you'll take document 0, 10, 20... If epoch is 1, then you want documents 1, 11, 21, ...

epoch right now means shard. So epoch 12 (out of 30) is shard 12 in our dataset (right?). I though sub-sharding was meant to index into each shard/epoch, and not.. round-robin across shards? What am I missing here @stephenroller ?

stephenroller commented 2 years ago

Assuming 2 subshards (a,b) and 3 shards (0, 1, 2),

Then we can iterate in two ways:

0a 1a 2a 0b 1b 2b (my original proposal)

Or we can iterate

0a 0b 1a 1b 2a 2b

Since it's called subshards, the latter does better meet the name and is more intuitive. So I agree, let's change it.