Open jpivarski opened 7 months ago
Since #491 is in progress, can we test with that code, rather than trying to fix code that's about to disappear?
Good idea. I'll check it on that git-branch.
On that git-branch, the uproot.dask
case is now broken (I assume something needs to be updated in Uproot),
>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 277, in dask
return _get_dak_array(
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1560, in _get_dak_array
return dask_awkward.from_map(
File "/tmp/dask-awkward/src/dask_awkward/lib/io/io.py", line 630, in from_map
form_with_unique_keys(io_func.form, "@"),
AttributeError: '_UprootRead' object has no attribute 'form'
and the dak.from_parquet
case is unchanged,
>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
<offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
<content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8eaa0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8ead0>
I think this must be because of the name of the one field containing a "." character, which is also used to indicate nesting.
That's right, it is:
>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501-nodot.parquet")["AnalysisJetsAuxDyn_pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
<offsets><Index dtype='int64' len='20001'>
[ 0 9 16 ... 189827 189839 189853]
</Index></offsets>
<content><NumpyArray dtype='float32' len='189853'>
[131514.36 126449.445 112335.195 ... 12131.118 13738.865 10924.1 ]
</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([ 0, 9, 16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ..., 12131.118, 13738.865,
10924.1 ], dtype=float32)
But we should expect that the field names can contain any characters, right? When I look up "parquet column names dot", I see a lot of instances of people doing this on Spark, which uses backticks to avoid interpreting dot as nesting.
Handling column names with dots in them (by requiring such columns to have backticks) might need to be implemented in ak.forms.Form.select_columns
. Would anything be needed in dask-awkward?
Actually, how does the dot cause column optimization to fail? What assumption is being broken?
If scikit-hep/awkward#3088 is fixed, what else would be needed?
I haven't spotted a place where we assume dots to be special, but I suspect that parquet might have this built in (or not).
This ZIP contains a ROOT file and a Parquet file.
dak-issue-501.zip
If we open it with
uproot.dask
, extract one field and compute it, we get what we expect:But if we open it with
dak.from_parquet
, extract one field and compute it, the field is populated with a PlaceholderArray:And of course, that would cause troubles downstream.
What happened here? This is almost the simplest case of column optimization that one could have.