axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.39k stars 797 forks source link

Support streaming from cloud storage for downloading training data #585

Open casper-hansen opened 11 months ago

casper-hansen commented 11 months ago

⚠️ Please check that this feature request hasn't been suggested before.

πŸ”– Feature description

Streaming data straight from cloud storage as the training of a model is ongoing is a great feature to have because it will inevitably be cheaper to stream than to rent large clusters and download a large dataset. Especially when running multi-node, this becomes important.

The idea is that you can store your data in an S3/cloud storage and directly stream batches of examples as you are training. This enables deterministic training and it will make it easier to recover from a hardware failure.

βœ”οΈ Solution

Integrate with MosaicML's streaming library. https://github.com/mosaicml/streaming

❓ Alternatives

No response

πŸ“ Additional Context

No response

Acknowledgements

NanoCode012 commented 10 months ago

I just created a PR #765 for loading dataset from cloud storage (S3,GCS). This is not same as Streaming as it downloads the entire thing, but just wanted to share in case it fits anyone's use case.

Streaming is definitely on our TODO radar as well!

casper-hansen commented 10 months ago

The new PR is a good use-case as well. Just need streaming enabled to stream in data

fmv1992 commented 4 months ago

I'm working on this.

I propose the following addition:

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.

β†’

    # loading from s3 or gcs
    # s3 creds will be loaded from the system default and gcs only supports public access
  - path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
    streaming: true

So a new key to datasets.path (streaming) with a default value of no. If set to yes then it shall use StreamingDataset.

What do you think?

NanoCode012 commented 4 months ago

@fmv1992 , hey, thanks for comment. This sounds great. Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

fmv1992 commented 4 months ago

@NanoCode012 ,

Would you be able to let us know what’s the drawback to using mosaic’s streaming dataset method?

The only drawback I can see is the introduction of a new optional dependency (with everything that comes with it). Other than that their implementation looks pretty solid.

The alternative would be for us to implement this ourselves, and this can get quite big depending on the features you want. At the very least one needs to download a few files in parallel (or use a "view" of a file that supports this partial reading feature), manage their rotation after use, and keep track of this rotation. All in all using a 3rd party library seems the most efficient solution here.

I'm eager to hear your thoughts and I'm open to alternatives as well.

ehartford commented 4 months ago

What about pretraining

fmv1992 commented 4 months ago

What about pretraining

I like the idea, but this is my first contribution to this repo. I would feel more comfortable doing this as small sized and small scoped as possible. I think supporting pretraining later will be easy once we have this merged and agreed upon the details (when the PR is merged).

NanoCode012 commented 4 months ago

Mosaic's streaming sound like a solid option. The dependency should be fine as the base packages required by it does not seem to clash with current packages.

Do you want to outline the changes you make first, so we can run through it, or would you prefer making a PR directly instead?

fmv1992 commented 4 months ago

@NanoCode012 , this is a sketch of what I'm doing:

β€”β€”β€”β€”β€”β€”β€”β€”β€”β€” tmp git_diff_to_image 1712835068 bVOCzQ β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”

Any criticism is welcome.

If I move the import to inside load_streaming_dataset I prevent any import errors from the optional package. I've seen the alternative pattern of:

has_mosaic_streaming_support = False
try:
    from streaming import StreamingDataset
    has_mosaic_streaming_support = True
except ImportError:
    pass
NanoCode012 commented 4 months ago

I think the section you're editing is for pretraining_dataset. If that was your intention, ignore the next part, else, you would need to edit in the current cloud section https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L221-L250 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L318-L333 .


One part I recall about Mosaic streaming was that it required conversion to its DatasetFormat: https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html#introduction

How would you deal with this? Or does this expect that the user has already converted it?

Imo, the current dataset cloud implementation is quite bare. I'm open to removing it to this streaming method if it's cleaner.


Re: import. I don't think we have a specific preference. Maybe just a simple utility, check_streaming_installed . See this function: check_mamba_ssm_installed

fmv1992 commented 4 months ago

I thought it was better to add the PR directly: https://github.com/OpenAccess-AI-Collective/axolotl/pull/1525 . Let me know what you think.

(I suggest we move further discussion to that PR).