fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
307 stars 78 forks source link

Unable to append to already existing Kerchunk parquet store #487

Closed sreesanjeevkg closed 2 months ago

sreesanjeevkg commented 2 months ago

I attempted to append to an existing Parquet store, but encountered a KeyError with the following message: Cloud_Retrieval_Fraction_16_Liquid/.zarray.

Have included the code below for reference.

jsons_path = Path("/data/ssanjeev/temp/kerchunk_https_CLDPROP_D3_VIIRS_SNPP_2020-02-01_2020-03-01")
jsons_list = sorted(jsons_path.glob("*.json"))
print(f"found {len(jsons_list)} json files")
ref = LazyReferenceMapper("/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq")
batches = []
for i in range(0, len(jsons_list), 10):
    batch = list(islice(jsons_list, i, i + 10))
    batches.append([i, batch])
for i, batch in tqdm.tqdm(batches):
    print(i)
    print("append")
    MultiZarrToZarr.append(
            original_refs=ref,
            path=batch,
            coo_map={"time": lambda index, fs, var, fn: fnmeta.identify(fn)["begin_time"]},
            coo_dtypes={"time": np.dtype("M8[s]")},
            concat_dims=["time"],
        ).translate()

Screenshot 2024-07-31 at 2 18 56 PM

but when I load the kerchunk using open_dataset it is working fine as expected.

ds = xr.open_dataset("/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq", engine="kerchunk", chunks="auto")

Screenshot 2024-07-31 at 2 19 41 PM

@martindurant

martindurant commented 2 months ago

What exact version of fsspec do you have? I feel like I fixed something similar recently.

sreesanjeevkg commented 2 months ago

I am using the latest version of fsspec, Installed it again from git.

The fix you did earlier was for xr.open_dataset() - which is working fine; attached screenshot below for reference

Screenshot 2024-07-31 at 3 58 34 PM

but the append functionality is throwing the error

jsons_path = Path("/data/ssanjeev/temp/kerchunk_https_CLDPROP_D3_VIIRS_SNPP_2020-02-01_2020-03-01")
jsons_list = sorted(jsons_path.glob("*.json"))

# fs = fsspec.filesystem(
#             "reference",
#             fo="/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq"
#         )
parq = "/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq"
batches = []
for i in range(0, len(jsons_list), 10):
    batch = list(islice(jsons_list, i, i + 10))
    batches.append([i, batch])
for i, batch in tqdm.tqdm(batches):
    print(i)
    print("append")
    MultiZarrToZarr.append(
            original_refs=parq, # fs.get_mapper()
            path=batch,
            coo_map={"time": lambda index, fs, var, fn: fnmeta.identify(fn)["begin_time"]},
            coo_dtypes={"time": np.dtype("M8[s]")},
            concat_dims=["time"],
        ).translate()

Screenshot 2024-07-31 at 4 02 20 PM

Maybe something to do with metadata storage? because when I use

out = LazyReferenceMapper.create(root=str(self.parquetPath), fs=self.fs_local) for the first time and then keep appending using the same out, the append() is working fine but when I need to append to an already existing parquet I come across this error.

And I directly created the Parquet, without creating JSON.

self.parquetPath = f"/data/ssanjeev/MultiZarrtoZarr/{prod}_{parqStartDate}_{parqEndDate}.parq"
self.out = LazyReferenceMapper.create(root=str(self.parquetPath), fs=self.fs_local)

MultiZarrToZarr(
            batch,
            coo_map={"time": lambda index, fs, var, fn: fnmeta.identify(fn)["begin_time"]},
            coo_dtypes={"time": np.dtype("M8[s]")},
            concat_dims=["time"],
            out=self.out,
        ).translate()
sreesanjeevkg commented 2 months ago

@martindurant Any updates ?

martindurant commented 2 months ago

I haven't had a chance to look yet. Any chance of e standalone reproducer (no data file dependency)?

sreesanjeevkg commented 2 months ago

Let me check if I can put together a google colab notebook,

These are the netcdf4 files, https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/CLDPROP_D3_VIIRS_SNPP--5111/2024-06-21..2024-07-05/DB/World

sreesanjeevkg commented 2 months ago

KerchunkErrorAppend.md

@martindurant I have included a notebook for reproducing the error. I downloaded the first two files from https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/CLDPROP_D3_VIIRS_SNPP--5111/2024-06-21..2024-07-05/DB/World

created parquet with the first netcdf4, and accessed it [worked fine] , but unable to append the next JSON to it.