@tacaswell and I have wanted to add support for AwkwardArrays from the start, but I think we do not have a GH Issue for it yet.
Notes from chat with @jpivarski a couple weeks ago...
Requirements:
We would like to support upload and download of AwkwardArray structures.
In the Python client we would like the option to access the data with or without dask-awkward.
In normal Tiled fashion, we would like to be able to download a specific slice of interest, and we would like the Tiled server to be able to only read, serializing, and transmit the specific slice of interest.
Proposed Approach:
We considered using Arrow to transport AwkwardArrays between client and server. However, representing Awkward in Arrow blurs out detailed form information. Specifically, it loses the form_key that can be used to address specific buffers.
Instead, we will operate directly on AwkwardArray's own representation, which comprises JSON-encodable form, outer length (an integer), and a dict-like container whose keys are referenced in the form and whose values are buffers.
By reusing the typetracer machinery in awkward (which was developed to support dask-awkward) we can project a slice into a form and get a "projected form". The example below illustrates this, and uses only one piece of internal awkward API (_touch_data). This could conceivably be made into a public method.
Multiple buffers may be encoded in a container format like TAR or ZIP (not necessarily compressed, just used as a container). @jakirkham pointed out an advantage of ZIP: web browsers understand it.
Code snippet:
import numpy as np
import awkward as ak
# The array we want to talk about.
array = ak.Array(
[[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]]
)
# On the server, you separately store form, length, and all the named buffers.
form, length, container = ak.to_buffers(array)
# When a client wants a lazy (slice-only) object, send them the form+length
# and keep a type-tracer array in your TiledObject's metadata.
meta_step1 = ak.Array(
form.length_zero_array(highlevel=False).to_typetracer(forget_length=True)
)
# Second type-tracer array will tell us the set of buffers that the result of
# the slice will need, so that the client can make multiple requests.
typetracer, report = ak.typetracer.typetracer_with_report(
form,
forget_length=True,
)
meta_step2 = ak.Array(typetracer)
# You can test the slice on meta_step1 or meta_step2, but meta_step2 will also
# tell you which buffers of the *sliced* array you'll need.
try:
meta_step2[0, "y", 1:].layout._touch_data(recursive=True)
except:
print("Nope, you can't do it!")
else:
print("Yes, you can.")
# This is a list of nodes (prefixes of form_keys) in the *sliced* array.
print(report.data_touched)
form_keys_touched = set(report.data_touched)
# Having decided that a slice is okay, serialize it and send it to the server.
# Maybe send one HTTP request per node/expected form_key, but maybe not.
# On the server, get an array to slice. We only want to read the parts that
# will survive after slicing. Do it by making a meta_step2, slice, and look
# at the report.
# Let's assume at this point that we have a report with the nodes that are touched.
# Project the form onto a smaller form that doesn't have record fields that won't
# survive the slice.
def project_form(form):
if isinstance(form, ak.forms.RecordForm):
if form.fields is None:
original_fields = [None] * len(form.contents)
else:
original_fields = form.fields
fields = []
contents = []
for field, content in zip(original_fields, form.contents):
projected = project_form(content)
if projected is not None:
fields.append(field)
contents.append(content)
if form.fields is None:
fields = None
return form.copy(fields=fields, contents=contents)
elif isinstance(form, ak.forms.UnionForm):
raise NotImplementedError
elif isinstance(form, (ak.forms.NumpyForm, ak.forms.EmptyForm)):
if form.form_key in form_keys_touched:
return form.copy()
else:
return None
else:
if form.form_key in form_keys_touched:
return form.copy(content=project_form(form.content))
else:
return None
projected_form = project_form(form)
print(form)
print(projected_form)
projected_container = container
projected_array = ak.from_buffers(projected_form, length, projected_container)
print(repr(projected_array))
print(repr(projected_array[0, "y", 1:]))
# Send that!
When Awkward arrays are uploaded via HTTP, a good storage format is directory-of-buffers, where the filename is the form key. This enables a future enhancement where buffers can be added (and removed and updated) without copying all the unchanged buffers. More standard formats like Parquet would not enable this.
The form itself, and the length, will be in the tiled "structure" in the database.
Structures are often repeated. A run of many root files many have an identical structure. There could be benefit in the future to storing this in a separate table, with a foreign key. It matters more for awkward than for array because the form JSON can be comparatively large.
Serving a directory of existing root files is a sensible thing to try. But, to start, the whole root file will have to be marshaled from disk. Grabbing selecting columns (or form keys...) would require detailed knowledge of root. This is a kerchunk-like optimization. Note that for PB-scale root files the offsets themselves---the table of contents, so to speak---is itself TB-scale. JSON encoding is not the way.
This is well begun and released in v0.1.0a107, but there are some interesting ideas above I want to address or capture in separate GH issues before closing this.
@tacaswell and I have wanted to add support for AwkwardArrays from the start, but I think we do not have a GH Issue for it yet.
Notes from chat with @jpivarski a couple weeks ago...
Requirements:
dask-awkward
.Proposed Approach:
form_key
that can be used to address specific buffers.form
, outerlength
(an integer), and a dict-likecontainer
whose keys are referenced in theform
and whose values are buffers._touch_data
). This could conceivably be made into a public method.Code snippet: