Open asfimport opened 4 years ago
David McGuire: Repro test case attached, since I don't understand the formatting here.
David McGuire: @jorisvandenbossche FYI.
Joris Van den Bossche / @jorisvandenbossche: Thanks for the reproducer!
Just to be sure, can you try with pq.ParquetDataset(....., use_legacy_dataset=False)
as well? (you might need to remove the metadata_nthreads
argument, as it's not yet supported to configure the multithreading there)
David McGuire: If it's not multi-threaded, then there won't be deadlock:
# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))
# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))
Running:
$ python repro.py
6
6
David McGuire:
@jorisvandenbossche best I can tell, this impacts all Hive-style partitioned datasets across S3 and local storage, and going back as many versions as I was able to test (until 0.10.0
, with the dependency on Cython
). This makes me think that supporting this use case was never very important, and might not be in the future. Can you help inform my understanding of the relative priority of fixing this defect?
Also, if the lack of traction is owing to it using the legacy implementation, should I start exploring the new implementation available starting with 1.0.0
, and reporting defects against it? Seems like it trips up on something pretty simple, to start (a Spark _SUCCESS
file):
OSError: Could not open parquet input source 'our-bucket/some/partitioned/dataset/_SUCCESS': Invalid: Parquet file size is 0 bytes
Wes McKinney / @wesm: I agree it would be good for this bug to be fixed, but as this is a project formed of volunteers it's hard to say when it will be a priority for someone. For example, my organization does prioritize fixing bugs that affect our corporate sponsors, but for everything else we do our best to fix bugs when we can relative to our other priorities.
David McGuire: I'm sympathetic to the economic reality of the FOSS funding model, which is why I'm trying to read between the lines on what I can expect to work well and get high-priority support. Just like any other project, I'm not asking you to change the priority, just clue me into it. :)
Would it be better to pursue debugging the new implementation, or continue with the old, given the priorities stemming from your corporate sponsorships?
David McGuire: @wesm put more succinctly, where can I find and report defects in a place that might be more inline with the teams existing priorities?
Joris Van den Bossche / @jorisvandenbossche:
[~dmcguire]
yes, this is indeed less of a priority for us because it happens in legacy, soon to be deprecated, code (and not only from a corporate sponsorship standpoint, also from the general project standpoint). Which doesn't mean it wouldn't be nice to fix it, to be clear, since ParquetDataset is still used a lot.
should I start exploring the new implementation available starting with 1.0.0, and reporting defects against it?
Yes, I think that is best. Certainly if putting considerable effort in experimenting and using pyarrow for parquet datasets, that is best targetting the new implementation.
Seems like it trips up on something pretty simple, to start (a Spark _SUCCESS file):
It should be able to handle that (at least I thought we have tests for it), so if that is not working, very welcome to open a JIRA for it
David McGuire: @jorisvandenbossche great information, thank you! I'll follow up on the other apparent defect, and hopefully, I'm just doing something wrong. If so, that might be a path forward to pursue with the Petastorm maintainer.
Joris Van den Bossche / @jorisvandenbossche: (I was actually planning to open an issue on the Petastorm issue tracker to bring up their usage of ParquetDataset going to be deprecated in the future (after seeing the usage in the issues you opened), but didn't come to it yet)
@martindurant good news (for you): I have a repro test case that is 100%
pyarrow
, so it looks likes3fs
is not involved.@jorisvandenbossche how should I follow up with this, based on
pyarrow.filesystem.LocalFileSystem
?Viewing the File System directories as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.
1) dataset 2) dataset/foo=1 3) dataset/foo=1/bar=2 4) dataset/foo=1/bar=2/baz=0 5) dataset/foo=1/bar=2/baz=1 6) dataset/foo=1/bar=2/baz=2 *) dataset/foo=1/bar=2/baz=0/qux=false *) dataset/foo=1/bar=2/baz=1/qux=false *) dataset/foo=1/bar=2/baz=1/qux=true *) dataset/foo=1/bar=2/baz=0/qux=true *) dataset/foo=1/bar=2/baz=2/qux=false *) dataset/foo=1/bar=2/baz=2/qux=true
Call with 6 threads completes.
Call with 5 threads hangs indefinitely.
*NOTE:* this also happens with the un-decorated
LocalFileSystem
, and when omitting thefilesystem
argument.Reporter: David McGuire
Original Issue Attachments:
Externally tracked issue: https://github.com/dask/s3fs/issues/365
Note: This issue was originally created as ARROW-10029. Please see the migration documentation for further details.