[x] Implemented the streaming data fetching option (using --streaming) in order to avoid downloading the 1T tokens just to keep ~400M. This option is, however, rather slow but I don't think we can do better at the moment.
[x] Added --seed option to allow reproducibility.
[x] Added the missing starcoder.txt (previously langs.txt).
Changes:
--streaming
) in order to avoid downloading the 1T tokens just to keep ~400M. This option is, however, rather slow but I don't think we can do better at the moment.--seed
option to allow reproducibility.starcoder.txt
(previouslylangs.txt
).experience_replay.py |-> replay.py
.