Open mrocklin opened 1 year ago
Interesting - Should definitely not trigger computation. I'll look into it.
Quick note that the problem is unrelated to ReadParquet
. We don't support grouping on Series
yet, so df.groupby(df.hvfhs_license_num)
is technically not allowed. Trying to figure out why we end up hanging instead of raising an error here.
If you watch the dashboard there are definitely read_parquet tasks running
On Thu, Jun 1, 2023 at 3:00 PM Richard (Rick) Zamora < @.***> wrote:
Quick note that the problem is unrelated to ReadParquet. We don't support grouping on Series yet, so df.groupby(df.hvfhs_license_num) is technically not allowed. Trying to figure out why we end up hanging instead of raising an error here.
— Reply to this email directly, view it on GitHub https://github.com/mrocklin/dask-expr/issues/131#issuecomment-1572112787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBH3ZWIU5KVDG2ZRJLXJCOAJANCNFSM6AAAAAAYW6ICW4 . You are receiving this because you authored the thread.Message ID: @.***>
If you watch the dashboard there are definitely read_parquet tasks running
Sure, but I think that's more about Groupby
doing something unexpected. None of that happens if you just do:
df["tipped"] = df.tips != 0
df.groupby('hvfhs_license_num').tipped.mean()
Ah, I see. Well then, there's this:
%%prun
df["tipped"] = df.tips != 0
df.groupby("hvfhs_license_num").tipped.mean()
This takes two seconds
212142 function calls (209178 primitive calls) in 4.493 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
220 2.150 0.010 2.150 0.010 {method 'acquire' of '_thread.lock' objects}
8 1.076 0.135 4.266 0.533 arrow.py:958(_collect_dataset_info)
8 1.004 0.126 1.008 0.126 {method 'finish' of 'pyarrow._dataset.DatasetFactory' objects}
53237/52782 0.010 0.000 0.017 0.000 {built-in method builtins.isinstance}
8 0.008 0.001 0.011 0.001 {method 'empty_table' of 'pyarrow.lib.Schema' objects}
5760 0.006 0.000 0.007 0.000 {method 'normalize_path' of 'pyarrow._fs.FileSystem' objects}
306 0.005 0.000 0.005 0.000 {method 'reduce' of 'numpy.ufunc' objects}
616 0.004 0.000 0.005 0.000 base.py:229(construct_from_string)
Right, I see. I suppose it's probably time to start caching the dataset_info
result outside the ReadParquet
object itself. We currently need to extract basic metadata information every time a new ReadParquet
expression is created.
I just did a quick experiment to confirm that this change takes that cell down to ~20ms. However, there are other groupby-related issues that produce bad results (although the .agg
equivalent seems to be fine...).
Note that https://github.com/mrocklin/dask-expr/pull/132 provides one possible solution to the caching problem. Will need to address the groupby issue separately.
I think the ReadParquet
pain reported here is now resolved. However, I suppose we may want to consider allowing more than one object to be cached in the case that the objects originate from ReadParqet
expressions reading from different paths. For example, #132 will be less effective if we are reading two different parquet datasets and joining them.
I'm running this:
import dask.dataframe as dd
import dask_expr as dx
df = dx.read_parquet(
"s3://coiled-datasets/uber-lyft-tlc/",
)
df = df[df.base_passenger_fare > 0]
result = df.base_passenger_fare.mean()
result.compute()
And it's stuck in a loop calling gather-pq-parts
over and over again. I'm not sure why it would need to call this function at all.
Oh no! I’m OOO, but I’ll try to look into this soon.
Turning off predicate pushdown solves the issue:
diff --git a/dask_expr/io/parquet.py b/dask_expr/io/parquet.py
index e9c4370..7b1e5bf 100644
--- a/dask_expr/io/parquet.py
+++ b/dask_expr/io/parquet.py
@@ -115,6 +115,7 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO):
if isinstance(parent, Filter) and isinstance(
parent.predicate, (LE, GE, LT, GT, EQ, NE, And, Or)
):
+ return
# Predicate pushdown
filters = _DNF.extract_pq_filters(self, parent.predicate)
if filters:
And it's stuck in a loop calling gather-pq-parts over and over again. I'm not sure why it would need to call this function at all.
I think the column projection and predicate pushdown are just producing several different variations of the ReadParquet
expression. For every new variation (different columns and/or filters), we are calling npartitions
somewhere, which is triggering a fresh metadata-collection call (since npartitions
is calculated from _divisions
).
About to look into this further...
This is one of my standard examples, and it's quite sad today
It looks like this second cell triggers computation when it should be instantaneous. I suspect the parquet lengths stuff. cc @rjzamora