facebookresearch / metaseq

Repo for external large-scale work
MIT License
6.45k stars 723 forks source link

Make epoch mean epoch again (in streaming language modeling) #198

Open suchenzang opened 2 years ago

suchenzang commented 2 years ago

We currently have the following unfortunate naming: https://github.com/facebookresearch/metaseq/blob/4288451502667dda2be71a0a1a9df5066b583ae8/metaseq/tasks/streaming_language_modeling.py#L271-L290

where our training corpus is chunked up into shards, but each shard gets referenced as an epoch.

We should fix this confusing naming to make it clear that an epoch consists of shards (before repeating / looping over the same dataset again).

Relates to https://github.com/facebookresearch/metaseq/pull/189 and https://github.com/facebookresearch/metaseq/issues/166.

Note: be careful of rng state here. On restart, we want to make sure shuffled dataset is in the same order.

KUNAL1612 commented 2 years ago

Is the intent of this issue to just rename epoch into something more meaningful and intuitive like shard_index?

suchenzang commented 2 years ago

@KUNAL1612 To some degree - you'll have to look at why it was named epoch in the first place (i.e. how epoch is used outside of this class). My cursory understanding is that we also have some kind of "shuffling" / rng state that is tracked / refreshed per "epoch", which gets hijacked here to be applied across "shards".