Wrote simple streaming dataloader

lennart-finke commented 2 months ago

Description

Added a streaming dataloader to avoid storing large datasets.

Related Issue

Progress towards #8

danbraunai-apollo commented 2 months ago

Does this still work with distributed data parallel? I'm a bit concerned that this setup is more complex than needed. In general I think we'll need to support:

The dataset and tokenizer will be hosted on huggingface.
The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

I've added the above to #8 (sorry I should have done this earlier).

lennart-finke commented 2 months ago

When writing it I was thinking about distributed data parallel, and it could work, but I have not been able to test that.

Currently the code assumes that the the tokenizer comes from tiktoken and that the streamed dataset is not tokenized.

Am open to closing the PR if there is an easier solution; that decision I would leave to you.

lennart-finke commented 2 months ago

After looking into this a bit more, there is also a builtin in Huggingface for this, the function split_dataset_by_node. If that makes sense, next I'll adapt the file you mentioned to use this.

lennart-finke commented 1 week ago

Looks good! Noting that I didn't run anything. There are no tests to check for things like the dataloader distributing things correctly and the tokenizer working properly. Adding such tests would be valuable, but also could be time consuming.

Good idea, I added a minimal test for the dataloader, in particular that it gives back the right shape and that different ddp ranks give different data.

At a minimum, you should check that the dataloader does give you different data on each processes (when you start doing multiprocessing) and that the tokenizer works as you'd expect.

Right, the tokenizer is still a bit of a mystery to me, and the current behaviour is likely not what we want.

For instance, encoding and decoding a story

In a magical forest, there was a kind bear who loved to help. One day, he found a sparkling gem among the leaves. Curiously, he touched it and turned into a fluffy bunny! The bunny hopped with joy but soon noticed a sad squirrel up in a tree. The bunny hopped closer and asked, "Why are you sad?" The squirrel replied, "I can't find my acorns. I need them for winter!" The bunny wanted to help, so he put on his bunny ears and searched everywhere in the forest. "Don't worry, I'll find your acorns!" he promised.

gives

a magical forest , there was a kind bear who loved to help . day , he found a sparkling gem among the leaves . , he touched it and turned into a fluffy bunny ! bunny hopped with joy but soon noticed a sad squirrel up in a tree . bunny hopped closer and asked , " are you sad ? " squirrel replied , " can \' t find my acorn ##s . need them for winter ! " bunny wanted to help , so he put on his bunny ears and searched everywhere in the forest . " \' t worry , \' l ##l find your acorn ##s ! " he promised .

Now that you point it out like that, I should at least understand what's happening there before merging...

lennart-finke commented 1 week ago

Now I understand what was going on, the dataloader needs to convert to lowercase for this tokenizer. Now it works as expected!

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

Apart from that, ready for merging I believe.

danbraunai commented 1 week ago

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

nix-apollo commented 1 week ago

By the way, any thoughts about whether to convert to lowercase and remove whitespace, as is currently done?

No opinions myself, I haven't been following the tokenizer stuff. @nix-apollo?

My opinion is that the tokenizer lowercasing and removing whitespace is worth it for most research usecases. But it also adds complexity and confusion for people as evidenced here. Tokenizers not preserving all information when tokenizing and detokenizing could certainly lead to bugs in the user's code, for instance (although at least in this case it's pretty obvious).

Currently on balance I am in favor of lowercase and some form of whitespace normalizing, but that's a weakly held opinion.

lennart-finke commented 1 week ago

I see, thanks! Then I propose to just keep with the current tokenizer setup for the beta training run and maybe adjust for the final run if we encounter issues with it.

danbraunai / simple_stories_train

Wrote simple streaming dataloader #10

Description

Related Issue