US: Integrate HuggingFace's Datasets library with Oracle Cloud Storage

sdake commented 11 months ago

As data scientists or an AI practitioners, we need to use and share datasets consistently without a network connection to HuggingFace Hub, so that we can build new capabilities.

Acceptance Criteria:

[ ] ONLY *.jsonl supported.
[ ] Our reference dataset is Realnews. I have the dataset in our Lab storage. Contact @sdake.
[ ] implementation of command line tool wisdomctl.
[ ] Artificial Wisdom™ Learning Library SHALL offer a SendToSource() and GetFromSource().
[ ] GetFromSource() reads a dataset from Oracle Cloud Object Storage
[ ] SendToSource() sends a dartaset to Oracle Cloud Storage.
[ ] Manual testing that verifies the dataset can be both read and written.
[ ] DOD version 0 met.

References:

[ ] Oracle Cloud Storage fsspec integration.
[ ] Huggingface Datasets Python Library.
[ ] Arrow and Huggingface.

Whereas:

Huggingface has a strong preference the Apache Arrow format when using their Datasets library.
Huggingface provides an integration point for cloud providers using fsspec.
Oracle provides an fsspec implementation of their Oracle Cloud Storage.

Therefore: The shortest path to an implementation that can be evaluated involves the listed components for their listed responsibilities.

[ ] Future iterations will divide the command line tool into a REST API and a command line tool that accesses the REST API, assuming the evaluation of the dataset is reliable and performant.
[ ] Future iterations will add a ReplicateHuggingFaceDataset() API.

MostAwesomeDude commented 10 months ago

I think I could put together a short Python script to serve as a prototype wisdomctl. Just to be clear, you want JSON Lines locally and Arrow in Cloud Storage? Or JSON Lines both locally and in Cloud Storage?

sdake commented 10 months ago

@MostAwesomeDude Lets make .jsonl (a row-based data format) work well, initially.

Here is where I started:

obtained the 130 gibi dataset in a compressed .jsonl.
Implemented shell script to select the dataset type (train, validate, or test), and write to a massive file (so, 1 train file, 1 test file)
Implemented shell script to split both the train and validate datasets into 32 independent shards with the same number of documents.

After having a baseline understanding by experiencing workflows related to datasets, I then read the Datasets source code and found:

A split generator
A shard manager.

To obtain the realnews dataset, I filled out a google form, and then a link was sent to me containing a compressed .jsonl.

All that to say, I was learning the key requirements of managing one dataset.

Arrow (versus Parquet), are structural. I don't hold a strong opinion of what the lower layers of the dataset are implemented against, atleast not yet.

My primary goal was to use this dataset. The next goal will be to use the dataset effeciently.

Over the next few days, I plan to convert my tooling to work with datasets instead of raw .jsonl. files. To do this, I plan to build a dataset as described here:

https://huggingface.co/docs/datasets/about_dataset_load#building-a-dataset

Once this dataset is built, the dataset, whether it be parquet, or arrow, will be stored in Oracle. cloud.

Efficiently accessing it is what this user story is about. As in, its not good to download 120 GB file, and then create a bunch of files for split, versus train, versus all of the shards.

All this stated, I had found a dataset library written in rust about 6 weeks ago. I can't relocate it at this time, but I'd really like our dataset lbrary written in rust (becaue Python's garbage collection on string dataa is super ineffeciedint).

Again though, lets not cross ideas in different user stories. The purpose of this user story is to determine if Datasets can be streamed from OCI storage easily using fsspec.

That is to say,

sdake commented 10 months ago

@MostAwesomeDude take a look at this reference. When I wrote the user story, I found this, but misplaced it with my 300 chrome windows.

https://huggingface.co/docs/datasets/filesystems#oracle-cloud-object-storage

thanks -steve

sdake commented 10 months ago

And also this. https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=AbstractFileSystem#built-in-implementations.

Can you edit the user story and add these two references? If you would rather I do so, please let me know. Football is a 100 yard field.

sdake commented 10 months ago

Robert mentioned dask as an option. https://huggingface.co/docs/datasets/filesystems#dask

sdake commented 10 months ago

THis may also be valuable: https://github.com/huggingface/datasets/blob/de6391d732ea0471ee5bdfb91b8cecc4503da96b/src/datasets/search.py#L553-L567

sdake commented 10 months ago

@MostAwesomeDude fwiw, we will be adopting arrow and feather. Here is the filesystem interface:

https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow

sdake commented 10 months ago

@MostAwesomeDude would you consider integrating this into your work?

from pathlib import Path
from pyarrow import feather
from pyarrow import json

def convert_jsonl_to_feather(jsonl_path: Path, feather_path: Path):
    table = json.read_json(jsonl_path)
    feather.write_feather(table, feather_path, compression='zstd')

convert_jsonl_to_feather('00_val_realnews.jsonl', '00_val_realnews.feather')

The purpose of this is to read a .jsonl file (or .parquet) and jam it into oracle coud as feather:?

This is ultimately what is needed. If you want to merge your PR, and then iterate iwth a new PR to achieve this objective, WFM!

MostAwesomeDude commented 10 months ago

Yeah, I'll do that tomorrow.

artificialwisdomai / origin

US: Integrate HuggingFace's Datasets library with Oracle Cloud Storage #108