Open jpivarski opened 4 months ago
Another indicator: passing optimize_graph=False
to compute
make it work. It's the column-optimization.
Is it expected to be possible to do concatenate on typetracers? It should be needed, since we will need to know the columns to select from both input layers independently - we have no mechanism to carry the columns found needed for one layer across to another.
So far I have found these edges:
So indeed, the second partition is receiving columns=[]
(nothing to load), and it unproject_layout
is turning all the missing columns into typetracers.
I can confirm that the one-pass branch successfully computes the second failing case, but _only_the_firsttime. Subsequent computes fail. The fail mode in having no required columns passed to parquet at all. Calling dak.core.dak_cache.clear()
causes it to pass again, so we have a good hint of where the problem is.
Given that #526 has a variant of this same problem, is it time to dust off the one-pass PR?
(I should say, that a trivial, but not great, workaround for the issue here is to touch all inputs to a concatenate, which somehow is what the other linked issue ended up doing (because of axis=1, presumably).
Here's a reproducer:
files.tar.gz
succeeds but
fails with
Going into more detail, the troublemaker is
self._mask.data
, which is a PlaceholderArray. The rehydration must be saying that this buffer is not needed, but it is needed. The concatenation needs to know which array elements are missing.