Support streaming from cloud storage for downloading training data

casper-hansen commented 1 year ago

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Streaming data straight from cloud storage as the training of a model is ongoing is a great feature to have because it will inevitably be cheaper to stream than to rent large clusters and download a large dataset. Especially when running multi-node, this becomes important.

The idea is that you can store your data in an S3/cloud storage and directly stream batches of examples as you are training. This enables deterministic training and it will make it easier to recover from a hardware failure.

✔️ Solution

Integrate with MosaicML's streaming library. https://github.com/mosaicml/streaming

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

NanoCode012 commented 1 year ago

I just created a PR #765 for loading dataset from cloud storage (S3,GCS). This is not same as Streaming as it downloads the entire thing, but just wanted to share in case it fits anyone's use case.

Streaming is definitely on our TODO radar as well!

casper-hansen commented 1 year ago

The new PR is a good use-case as well. Just need streaming enabled to stream in data

fmv1992 commented 7 months ago

I'm working on this.

I propose the following addition:

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.

→

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
    streaming: true

So a new key to datasets.path (streaming) with a default value of no. If set to yes then it shall use StreamingDataset.

What do you think?

NanoCode012 commented 7 months ago

@fmv1992 , hey, thanks for comment. This sounds great. Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

fmv1992 commented 7 months ago

@NanoCode012 ,

Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

The only drawback I can see is the introduction of a new optional dependency (with everything that comes with it). Other than that their implementation looks pretty solid.

The alternative would be for us to implement this ourselves, and this can get quite big depending on the features you want. At the very least one needs to download a few files in parallel (or use a "view" of a file that supports this partial reading feature), manage their rotation after use, and keep track of this rotation. All in all using a 3rd party library seems the most efficient solution here.

I'm eager to hear your thoughts and I'm open to alternatives as well.

ehartford commented 7 months ago

What about pretraining

fmv1992 commented 7 months ago

What about pretraining

I like the idea, but this is my first contribution to this repo. I would feel more comfortable doing this as small sized and small scoped as possible. I think supporting pretraining later will be easy once we have this merged and agreed upon the details (when the PR is merged).

NanoCode012 commented 7 months ago

Mosaic's streaming sound like a solid option. The dependency should be fine as the base packages required by it does not seem to clash with current packages.

Do you want to outline the changes you make first, so we can run through it, or would you prefer making a PR directly instead?

fmv1992 commented 7 months ago

@NanoCode012 , this is a sketch of what I'm doing:

—————————— tmp git_diff_to_image 1712835068 bVOCzQ ——————————

Any criticism is welcome.

If I move the import to inside load_streaming_dataset I prevent any import errors from the optional package. I've seen the alternative pattern of:

has_mosaic_streaming_support = False
try:
    from streaming import StreamingDataset
    has_mosaic_streaming_support = True
except ImportError:
    pass

NanoCode012 commented 7 months ago

I think the section you're editing is for pretraining_dataset. If that was your intention, ignore the next part, else, you would need to edit in the current cloud section https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L221-L250 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L318-L333 .

One part I recall about Mosaic streaming was that it required conversion to its DatasetFormat: https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html#introduction

How would you deal with this? Or does this expect that the user has already converted it?

Imo, the current dataset cloud implementation is quite bare. I'm open to removing it to this streaming method if it's cleaner.

Re: import. I don't think we have a specific preference. Maybe just a simple utility, check_streaming_installed . See this function: check_mamba_ssm_installed

fmv1992 commented 7 months ago

I thought it was better to add the PR directly: https://github.com/OpenAccess-AI-Collective/axolotl/pull/1525 . Let me know what you think.

(I suggest we move further discussion to that PR).

axolotl-ai-cloud / axolotl