Closed sreesanjeevkg closed 2 months ago
What exact version of fsspec do you have? I feel like I fixed something similar recently.
I am using the latest version of fsspec, Installed it again from git.
The fix you did earlier was for xr.open_dataset() - which is working fine; attached screenshot below for reference
but the append functionality is throwing the error
jsons_path = Path("/data/ssanjeev/temp/kerchunk_https_CLDPROP_D3_VIIRS_SNPP_2020-02-01_2020-03-01")
jsons_list = sorted(jsons_path.glob("*.json"))
# fs = fsspec.filesystem(
# "reference",
# fo="/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq"
# )
parq = "/data/ssanjeev/MultiZarrtoZarr/CLDPROP_D3_VIIRS_SNPP_2020-01-01_2020-02-01.parq"
batches = []
for i in range(0, len(jsons_list), 10):
batch = list(islice(jsons_list, i, i + 10))
batches.append([i, batch])
for i, batch in tqdm.tqdm(batches):
print(i)
print("append")
MultiZarrToZarr.append(
original_refs=parq, # fs.get_mapper()
path=batch,
coo_map={"time": lambda index, fs, var, fn: fnmeta.identify(fn)["begin_time"]},
coo_dtypes={"time": np.dtype("M8[s]")},
concat_dims=["time"],
).translate()
Maybe something to do with metadata storage? because when I use
out = LazyReferenceMapper.create(root=str(self.parquetPath), fs=self.fs_local)
for the first time and then keep appending using the same out, the append() is working fine but when I need to append to an already existing parquet I come across this error.
And I directly created the Parquet, without creating JSON.
self.parquetPath = f"/data/ssanjeev/MultiZarrtoZarr/{prod}_{parqStartDate}_{parqEndDate}.parq"
self.out = LazyReferenceMapper.create(root=str(self.parquetPath), fs=self.fs_local)
MultiZarrToZarr(
batch,
coo_map={"time": lambda index, fs, var, fn: fnmeta.identify(fn)["begin_time"]},
coo_dtypes={"time": np.dtype("M8[s]")},
concat_dims=["time"],
out=self.out,
).translate()
@martindurant Any updates ?
I haven't had a chance to look yet. Any chance of e standalone reproducer (no data file dependency)?
Let me check if I can put together a google colab notebook,
These are the netcdf4 files, https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/CLDPROP_D3_VIIRS_SNPP--5111/2024-06-21..2024-07-05/DB/World
@martindurant I have included a notebook for reproducing the error. I downloaded the first two files from https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/CLDPROP_D3_VIIRS_SNPP--5111/2024-06-21..2024-07-05/DB/World
created parquet with the first netcdf4, and accessed it [worked fine] , but unable to append the next JSON to it.
I attempted to append to an existing Parquet store, but encountered a KeyError with the following message: Cloud_Retrieval_Fraction_16_Liquid/.zarray.
Have included the code below for reference.
but when I load the kerchunk using open_dataset it is working fine as expected.
@martindurant