fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
304 stars 78 forks source link

inline_threshold not encoding time value? #468

Open rsignell opened 3 months ago

rsignell commented 3 months ago

In the example below I was expecting that time would get encoded because inline_threshold=400 and time is only 8 bytes long.

Below we see that depth is encoded (it's 332 bytes long), but time is not.

Is this expected behavior?

image

martindurant commented 3 months ago

It seems inline is only called for normal, non-record arrays. Another thing to fix! Obviously, not too much use of netCDF3 has been seen.

rsignell commented 3 months ago

Well, it's not a high priority for me -- if truth be told, I was really just trying to figure out how to inject a known time value into that reference.

martindurant commented 3 months ago

You can replace the value with a binary, if you want. But also, there already is a function that does exactly this process for any reference set, so it just needs to be invoked.

kmsampson commented 3 months ago

You can replace the value with a binary, if you want. But also, there already is a function that does exactly this process for any reference set, so it just needs to be invoked.

Can you point to the function that can inject a known time value or how to replace the value with a binary?

martindurant commented 3 months ago

This may be fixed in #466 , if you would care to try.

@kmsampson , the spec says:

the str format of a reference value may be: a string starting “base64:”, which will be decoded to binary any other string, interpreted as ascii data

so set the key's value in the JSON accordingly. If still in memory, you can also directly assign the binary you want it to have. You could also use the filesystem interface, if you already made a filesystem, fs.pipe("time/0", b"\x00\x00..."); this modification can be outputted again with fs.save_json, or a .flush on the parquet/lazy storage, if you are using that. Too many options?

rsignell commented 3 months ago

Yes, this is fixed in #466: image

I still can't figure out how to assign a specific value though: image (perhaps I should ask this in discussions?)

martindurant commented 3 months ago

I would do

d["refs"]["time/0"] = data_bytes

and kerchunk.utils._encode_for_JSON or consolidate can do the encoding for you.

You can also make a filesystem and interact with it

fs = fsspec.filesystem("reference", fo=d, ...)
fs.cat("time/0", data_bytes)
fs.save_json(filename) OR grab fs.references
rsignell commented 3 months ago

I'm feeling kind of dumb here, but I still don't get it: image

martindurant commented 3 months ago

Oh sorry, the function works on the inner reference dict, d["refs"]