Closed benrutter closed 2 months ago
I can't reproduce what you are running into unfortunately. I am not sure how we can get there, we would require an empty directory but this is caught beforehand. Filters can't be specified, we are protected against this case. I think we need a bit more information to understand how we end uo there
@phofl - that's super interesting, I'm not that surprised that it's hard to reproduce, as it's been a bit of a hard to track down bug for me and so far I've only be able to recreate very specifically in an azure function, which is a really unusual environment for dask anyway.
It seems like a classic type mismatch error where something odd is probably happening at a different point but it just comes up there. I'll try and dig around to figure out how it's getting thrown - the code that threw the error trace I included was on dask 2024.3 if that makes a difference.
I think it might be caused by asking for the length of an empty dataframe from read_parquet actions, I'll do what I can to put together a minimal reproducible example.
Can you share the read_parquet invocation and what kwargs you are setting? That would be a good start
Ben @.***> schrieb am Fr. 12. Apr. 2024 um 20:05:
@phofl https://github.com/phofl - that's super interesting, I'm not that surprised that it's hard to reproduce, as it's been a bit of a hard to track down bug for me and so far I've only be able to recreate very specifically in an azure function, which is a really unusual environment for dask anyway.
It seems like a classic type mismatch error where something odd is probably happening at a different point but it just comes up there. I'll try and dig around to figure out how it's getting thrown - the code that threw the error trace I included was on dask 2024.3 if that makes a difference.
I think it might be caused by asking for the length of an empty dataframe from read_parquet actions, I'll do what I can to put together a minimal reproducible example.
— Reply to this email directly, view it on GitHub https://github.com/dask/dask-expr/issues/1021#issuecomment-2052231477, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOYQZGGPE7SNIBC4XJ4CMYLY5AO77AVCNFSM6AAAAABGEM2CW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJSGIZTCNBXG4 . You are receiving this because you were mentioned.Message ID: @.***>
Yup, read parquet invocation looks like this with no special kwargs other than storage_options. I've taken out the real paths, and some extra code that interprets the list which in the case throwing the error (although weirdly again, I'm only seeing the error in an azure function environment and not elsewhere*) winds up with a list of just 1.
df = dd.read_parquet(["abfs://somecontainer/folder/*.parquet"], storage_options=storage_options)
this_bit_will_crash = len(df)
(worth noting that in the crashing instance I found, the folder was in fact empty which may or may not be relevant)
Can you share how big the parquet file is or don't you exactly know how big it is?
I don't think there's any parquet file at all, just an empty folder in the case throwing the error. It's a step of a pipeline that's checking whether some data exists and then running some tests on it if it does. In this case, it's just an empty dataframe I think, which might be what's causing the problem (although that might be a red herring)
(I say empty dataframe, because that's the usual return of dd.read_parquet("some/not/matching/globstring/*.parquet") rather than it being a parquet with 0 data or something like that)
I also don't know if the empty data itself is causing the issue (I've found the bug in resources running at my work, so I'm slightly restricted on the tests I can run on it - sorrry, I know that's a bit annoying!)
Yeah found a reproducer, your folder is empty. Normally this would raise earlier (that's why I couldn't reproduce in another way), but globs in lists don't do that (for some reason)
Ah that's amazing - nice one!!
I'm asking this just for my own learning, but how come the _collect_pq_statistics ran when it normally wouldn't? Is this because normally trying to read an empty folder with something like dd.read_parquet("empty/folder")
throws an error (apart from when passed a list) so _collect_pq_statistics isn't expecting that as a possibility?
Yeah normally this raises way before you even get close to _collect_pq_statistics
Describe the issue:
I'm having a little trouble actually recreating this error, and initially thought it was related to azure functions: https://github.com/dask/dask/issues/11037.
I understand pretty well now the type of situations it'll happen in, so I'll just explain the cause of the bug.
Essentially, when the self._plan variable fo ReadParquetFSSpec is empty, it'll set it's internal
_io_func
property to be the identity function rather than the ParquetFunctionWrapper:This causes issues later when attempting to collect parquet statistics:
The line:
Will throw an error if _io_func is actually the identity function because, unlike the ParquetFunctionWrapper class it doesn't have an "fs" object.
That leads to an error with this kind of a stack:
Environment: