datonic / datadex

📦 Serverless and local-first Open Data Platform
http://datadex.datonic.io
MIT License
222 stars 14 forks source link

Implement glob patterns on IPFS #2

Closed davidgasquez closed 1 year ago

davidgasquez commented 2 years ago

For large datasets stored as multiple parquet/CSV files, it would be much better to have a glob pattern than to write multiple union all.

davidgasquez commented 2 years ago

Could perhaps be used with an S3 interface to IPFS.

davidgasquez commented 1 year ago

Another alternative is to mount IPFS as a local FS directory and use that. Kubo can do that, and these other projects might help:

davidgasquez commented 1 year ago

In theory, it should be possible to use fsspec IPFS implementation to initialize a PyArrow dataset. In practice, it fails. :sweat_smile:

import pyarrow as pa
import pyarrow.dataset as ds
from pyarrow.fs import PyFileSystem, FSSpecHandler
import ipfsspec
import duckdb

fs = ipfsspec.IPFSFileSystem()
pa_fs = PyFileSystem(FSSpecHandler(fs))

con = duckdb.connect()

sc = pa.schema([("year", pa.int16()), ("month", pa.int16()), ("day", pa.int16())])

data_schema = pa.schema(
    [
        ("height", pa.int64()),
        ("miner_id", pa.string()),
        ("sector_id", pa.string()),
        ("state_root", pa.string()),
        ("event", pa.string()),
        ("year", pa.int16()),
        ("month", pa.int16()),
        ("day", pa.int16()),
    ]
)

part = ds.partitioning(schema=sc, flavor="filename")
dataset = ds.dataset(
    "bafybeib5yuwr3hmbhw73gizhnsl5pje3cvdogwbrtvivyg53odhsabtdwe",
    filesystem=pa_fs,
    format="csv",
    partitioning=part,
)
davidgasquez commented 1 year ago

It works when using https://github.com/AlgoveraAI/ipfspy!