Closed sdake closed 10 months ago
I think I could put together a short Python script to serve as a prototype wisdomctl
. Just to be clear, you want JSON Lines locally and Arrow in Cloud Storage? Or JSON Lines both locally and in Cloud Storage?
@MostAwesomeDude Lets make .jsonl
(a row-based data format) work well, initially.
Here is where I started:
.jsonl
.After having a baseline understanding by experiencing workflows related to datasets, I then read the Datasets source code and found:
To obtain the realnews dataset, I filled out a google form, and then a link was sent to me containing a compressed .jsonl
.
All that to say, I was learning the key requirements of managing one dataset.
Arrow (versus Parquet), are structural. I don't hold a strong opinion of what the lower layers of the dataset are implemented against, atleast not yet.
My primary goal was to use this dataset. The next goal will be to use the dataset effeciently.
Over the next few days, I plan to convert my tooling to work with datasets instead of raw .jsonl
. files. To do this, I plan to build a dataset as described here:
https://huggingface.co/docs/datasets/about_dataset_load#building-a-dataset
Once this dataset is built, the dataset, whether it be parquet, or arrow, will be stored in Oracle. cloud.
Efficiently accessing it is what this user story is about. As in, its not good to download 120 GB file, and then create a bunch of files for split, versus train, versus all of the shards.
All this stated, I had found a dataset library written in rust about 6 weeks ago. I can't relocate it at this time, but I'd really like our dataset lbrary written in rust (becaue Python's garbage collection on string dataa is super ineffeciedint).
Again though, lets not cross ideas in different user stories. The purpose of this user story is to determine if Datasets
can be streamed from OCI storage easily using fsspec
.
That is to say,
@MostAwesomeDude take a look at this reference. When I wrote the user story, I found this, but misplaced it with my 300 chrome windows.
https://huggingface.co/docs/datasets/filesystems#oracle-cloud-object-storage
thanks -steve
And also this. https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=AbstractFileSystem#built-in-implementations.
Can you edit the user story and add these two references? If you would rather I do so, please let me know. Football is a 100 yard field.
Robert mentioned dask
as an option. https://huggingface.co/docs/datasets/filesystems#dask
@MostAwesomeDude fwiw, we will be adopting arrow and feather. Here is the filesystem
interface:
https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow
@MostAwesomeDude would you consider integrating this into your work?
from pathlib import Path
from pyarrow import feather
from pyarrow import json
def convert_jsonl_to_feather(jsonl_path: Path, feather_path: Path):
table = json.read_json(jsonl_path)
feather.write_feather(table, feather_path, compression='zstd')
convert_jsonl_to_feather('00_val_realnews.jsonl', '00_val_realnews.feather')
The purpose of this is to read a .jsonl
file (or .parquet
) and jam it into oracle coud as feather
:?
This is ultimately what is needed. If you want to merge your PR, and then iterate iwth a new PR to achieve this objective, WFM!
Yeah, I'll do that tomorrow.
As data scientists or an AI practitioners, we need to use and share datasets consistently without a network connection to HuggingFace Hub, so that we can build new capabilities.
Acceptance Criteria:
*.jsonl
supported.wisdomctl
.SendToSource()
andGetFromSource()
.GetFromSource()
reads a dataset from Oracle Cloud Object StorageSendToSource()
sends a dartaset to Oracle Cloud Storage.References:
fsspec
integration.Whereas:
fsspec
.fsspec
implementation of their Oracle Cloud Storage.Therefore: The shortest path to an implementation that can be evaluated involves the listed components for their listed responsibilities.
Next:
ReplicateHuggingFaceDataset()
API.