dask-contrib / dask-awkward

Native Dask collection for awkward arrays, and the library to use it.
https://dask-awkward.readthedocs.io
BSD 3-Clause "New" or "Revised" License
61 stars 19 forks source link

dask-awkward array loaded by `dak.from_parquet` and field-sliced is not populated (has PlaceholderArrays) #501

Open jpivarski opened 7 months ago

jpivarski commented 7 months ago

This ZIP contains a ROOT file and a Parquet file.

dak-issue-501.zip

If we open it with uproot.dask, extract one field and compute it, we get what we expect:

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But if we open it with dak.from_parquet, extract one field and compute it, the field is populated with a PlaceholderArray:

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a45532925c0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a4553292260>

And of course, that would cause troubles downstream.

What happened here? This is almost the simplest case of column optimization that one could have.

martindurant commented 7 months ago

Since #491 is in progress, can we test with that code, rather than trying to fix code that's about to disappear?

jpivarski commented 7 months ago

Good idea. I'll check it on that git-branch.

jpivarski commented 7 months ago

On that git-branch, the uproot.dask case is now broken (I assume something needs to be updated in Uproot),

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 277, in dask
    return _get_dak_array(
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1560, in _get_dak_array
    return dask_awkward.from_map(
  File "/tmp/dask-awkward/src/dask_awkward/lib/io/io.py", line 630, in from_map
    form_with_unique_keys(io_func.form, "@"),
AttributeError: '_UprootRead' object has no attribute 'form'

and the dak.from_parquet case is unchanged,

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8eaa0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8ead0>
martindurant commented 7 months ago

I think this must be because of the name of the one field containing a "." character, which is also used to indicate nesting.

jpivarski commented 7 months ago

That's right, it is:

dak-issue-501-nodot.zip

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501-nodot.parquet")["AnalysisJetsAuxDyn_pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But we should expect that the field names can contain any characters, right? When I look up "parquet column names dot", I see a lot of instances of people doing this on Spark, which uses backticks to avoid interpreting dot as nesting.

Handling column names with dots in them (by requiring such columns to have backticks) might need to be implemented in ak.forms.Form.select_columns. Would anything be needed in dask-awkward?

jpivarski commented 7 months ago

Actually, how does the dot cause column optimization to fail? What assumption is being broken?

If scikit-hep/awkward#3088 is fixed, what else would be needed?

martindurant commented 7 months ago

I haven't spotted a place where we assume dots to be special, but I suspect that parquet might have this built in (or not).