bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 50 forks source link

Add support for AwkwardArray structures #450

Open danielballan opened 1 year ago

danielballan commented 1 year ago

@tacaswell and I have wanted to add support for AwkwardArrays from the start, but I think we do not have a GH Issue for it yet.

Notes from chat with @jpivarski a couple weeks ago...

Requirements:

Proposed Approach:

Code snippet:

import numpy as np
import awkward as ak

# The array we want to talk about.
array = ak.Array(
    [[{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}], [], [{"x": 3.3, "y": [1, 2, 3]}]]
)

# On the server, you separately store form, length, and all the named buffers.
form, length, container = ak.to_buffers(array)

# When a client wants a lazy (slice-only) object, send them the form+length
# and keep a type-tracer array in your TiledObject's metadata.
meta_step1 = ak.Array(
    form.length_zero_array(highlevel=False).to_typetracer(forget_length=True)
)

# Second type-tracer array will tell us the set of buffers that the result of
# the slice will need, so that the client can make multiple requests.
typetracer, report = ak.typetracer.typetracer_with_report(
    form,
    forget_length=True,
)
meta_step2 = ak.Array(typetracer)

# You can test the slice on meta_step1 or meta_step2, but meta_step2 will also
# tell you which buffers of the *sliced* array you'll need.
try:
    meta_step2[0, "y", 1:].layout._touch_data(recursive=True)
except:
    print("Nope, you can't do it!")
else:
    print("Yes, you can.")

# This is a list of nodes (prefixes of form_keys) in the *sliced* array.
print(report.data_touched)
form_keys_touched = set(report.data_touched)

# Having decided that a slice is okay, serialize it and send it to the server.
# Maybe send one HTTP request per node/expected form_key, but maybe not.

# On the server, get an array to slice. We only want to read the parts that
# will survive after slicing. Do it by making a meta_step2, slice, and look
# at the report.

# Let's assume at this point that we have a report with the nodes that are touched.

# Project the form onto a smaller form that doesn't have record fields that won't
# survive the slice.

def project_form(form):
    if isinstance(form, ak.forms.RecordForm):
        if form.fields is None:
            original_fields = [None] * len(form.contents)
        else:
            original_fields = form.fields

        fields = []
        contents = []
        for field, content in zip(original_fields, form.contents):
            projected = project_form(content)
            if projected is not None:
                fields.append(field)
                contents.append(content)

        if form.fields is None:
            fields = None

        return form.copy(fields=fields, contents=contents)

    elif isinstance(form, ak.forms.UnionForm):
        raise NotImplementedError

    elif isinstance(form, (ak.forms.NumpyForm, ak.forms.EmptyForm)):
        if form.form_key in form_keys_touched:
            return form.copy()
        else:
            return None

    else:
        if form.form_key in form_keys_touched:
            return form.copy(content=project_form(form.content))
        else:
            return None

projected_form = project_form(form)

print(form)
print(projected_form)

projected_container = container

projected_array = ak.from_buffers(projected_form, length, projected_container)

print(repr(projected_array))

print(repr(projected_array[0, "y", 1:]))

# Send that!
danielballan commented 1 year ago

Notes from discussion today:

danielballan commented 1 year ago

This is well begun and released in v0.1.0a107, but there are some interesting ideas above I want to address or capture in separate GH issues before closing this.