Open casper-hansen opened 1 year ago
I just created a PR #765 for loading dataset from cloud storage (S3,GCS). This is not same as Streaming
as it downloads the entire thing, but just wanted to share in case it fits anyone's use case.
Streaming
is definitely on our TODO radar as well!
The new PR is a good use-case as well. Just need streaming enabled to stream in data
I'm working on this.
I propose the following addition:
# loading from s3 or gcs
# s3 creds will be loaded from the system default and gcs only supports public access
- path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
β
# loading from s3 or gcs
# s3 creds will be loaded from the system default and gcs only supports public access
- path: s3://path_to_ds # Accepts folder with arrow/parquet or file path like above. Supports s3, gcs.
streaming: true
So a new key to datasets.path
(streaming
) with a default value of no. If set to yes then it shall use StreamingDataset.
What do you think?
@fmv1992 , hey, thanks for comment. This sounds great. Would you be able to let us know whatβs the drawback to using mosaicβs streaming dataset method?
@NanoCode012 ,
Would you be able to let us know whatβs the drawback to using mosaicβs streaming dataset method?
The only drawback I can see is the introduction of a new optional dependency (with everything that comes with it). Other than that their implementation looks pretty solid.
The alternative would be for us to implement this ourselves, and this can get quite big depending on the features you want. At the very least one needs to download a few files in parallel (or use a "view" of a file that supports this partial reading feature), manage their rotation after use, and keep track of this rotation. All in all using a 3rd party library seems the most efficient solution here.
I'm eager to hear your thoughts and I'm open to alternatives as well.
What about pretraining
What about pretraining
I like the idea, but this is my first contribution to this repo. I would feel more comfortable doing this as small sized and small scoped as possible. I think supporting pretraining later will be easy once we have this merged and agreed upon the details (when the PR is merged).
Mosaic's streaming sound like a solid option. The dependency should be fine as the base packages required by it does not seem to clash with current packages.
Do you want to outline the changes you make first, so we can run through it, or would you prefer making a PR directly instead?
@NanoCode012 , this is a sketch of what I'm doing:
ββββββββββ ββββββββββ
Any criticism is welcome.
If I move the import to inside load_streaming_dataset
I prevent any import errors from the optional package. I've seen the alternative pattern of:
has_mosaic_streaming_support = False
try:
from streaming import StreamingDataset
has_mosaic_streaming_support = True
except ImportError:
pass
I think the section you're editing is for pretraining_dataset
. If that was your intention, ignore the next part, else, you would need to edit in the current cloud section https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L221-L250 and https://github.com/OpenAccess-AI-Collective/axolotl/blob/5ed29393e34cf57b24a20ac1bafa3a94272ac3f5/src/axolotl/utils/data/sft.py#L318-L333 .
One part I recall about Mosaic streaming was that it required conversion to its DatasetFormat: https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/dataset_format.html#introduction
How would you deal with this? Or does this expect that the user has already converted it?
Imo, the current dataset cloud implementation is quite bare. I'm open to removing it to this streaming method if it's cleaner.
Re: import. I don't think we have a specific preference. Maybe just a simple utility, check_streaming_installed
. See this function: check_mamba_ssm_installed
I thought it was better to add the PR directly: https://github.com/OpenAccess-AI-Collective/axolotl/pull/1525 . Let me know what you think.
(I suggest we move further discussion to that PR).
β οΈ Please check that this feature request hasn't been suggested before.
π Feature description
Streaming data straight from cloud storage as the training of a model is ongoing is a great feature to have because it will inevitably be cheaper to stream than to rent large clusters and download a large dataset. Especially when running multi-node, this becomes important.
The idea is that you can store your data in an S3/cloud storage and directly stream batches of examples as you are training. This enables deterministic training and it will make it easier to recover from a hardware failure.
βοΈ Solution
Integrate with MosaicML's streaming library. https://github.com/mosaicml/streaming
β Alternatives
No response
π Additional Context
No response
Acknowledgements