apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.38k stars 3.5k forks source link

[Python] Deadlock in the interaction of pyarrow FileSystem and ParquetDataset #26051

Open asfimport opened 4 years ago

asfimport commented 4 years ago

@martindurant good news (for you): I have a repro test case that is 100% pyarrow, so it looks like s3fs is not involved.

@jorisvandenbossche how should I follow up with this, based on pyarrow.filesystem.LocalFileSystem?

Viewing the File System directories as a tree, one thread is required for every non-leaf node, in order to avoid deadlock.

1) dataset 2) dataset/foo=1 3) dataset/foo=1/bar=2 4) dataset/foo=1/bar=2/baz=0 5) dataset/foo=1/bar=2/baz=1 6) dataset/foo=1/bar=2/baz=2 *) dataset/foo=1/bar=2/baz=0/qux=false *) dataset/foo=1/bar=2/baz=1/qux=false *) dataset/foo=1/bar=2/baz=1/qux=true *) dataset/foo=1/bar=2/baz=0/qux=true *) dataset/foo=1/bar=2/baz=2/qux=false *) dataset/foo=1/bar=2/baz=2/qux=true


import pyarrow.parquet as pq
import pyarrow.filesystem as fs

class LoggingLocalFileSystem(fs.LocalFileSystem):
    def walk(self, path):
        print(path)
        return super().walk(path)

fs = LoggingLocalFileSystem()
dataset_url = "dataset"

threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
print(len(dataset.pieces))

threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, validate_schema=False, metadata_nthreads=threads)
print(len(dataset.pieces))

Call with 6 threads completes.

Call with 5 threads hangs indefinitely.


$ python repro.py 
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
dataset/foo=1/bar=2/baz=0/qux=false
dataset/foo=1/bar=2/baz=0/qux=true
dataset/foo=1/bar=2/baz=1/qux=false
dataset/foo=1/bar=2/baz=1/qux=true
dataset/foo=1/bar=2/baz=2/qux=false
dataset/foo=1/bar=2/baz=2/qux=true
6
dataset
dataset/foo=1
dataset/foo=1/bar=2
dataset/foo=1/bar=2/baz=0
dataset/foo=1/bar=2/baz=1
dataset/foo=1/bar=2/baz=2
^C
...
KeyboardInterrupt
^C
...
KeyboardInterrupt

*NOTE:* this also happens with the un-decorated LocalFileSystem, and when omitting the filesystem argument.

Reporter: David McGuire

Original Issue Attachments:

Note: This issue was originally created as ARROW-10029. Please see the migration documentation for further details.

asfimport commented 4 years ago

David McGuire: Repro test case attached, since I don't understand the formatting here.

asfimport commented 4 years ago

David McGuire: @jorisvandenbossche FYI.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: Thanks for the reproducer!

Just to be sure, can you try with pq.ParquetDataset(....., use_legacy_dataset=False) as well? (you might need to remove the metadata_nthreads argument, as it's not yet supported to configure the multithreading there)

asfimport commented 4 years ago

David McGuire: If it's not multi-threaded, then there won't be deadlock:


# This completes
threads = 6
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))

# This also completes
threads = 5
dataset = pq.ParquetDataset(dataset_url, filesystem=fs, use_legacy_dataset=False)
print(len(dataset.pieces))

Running:


$ python repro.py 
6
6
asfimport commented 4 years ago

David McGuire: @jorisvandenbossche best I can tell, this impacts all Hive-style partitioned datasets across S3 and local storage, and going back as many versions as I was able to test (until 0.10.0, with the dependency on Cython). This makes me think that supporting this use case was never very important, and might not be in the future. Can you help inform my understanding of the relative priority of fixing this defect?

Also, if the lack of traction is owing to it using the legacy implementation, should I start exploring the new implementation available starting with 1.0.0, and reporting defects against it? Seems like it trips up on something pretty simple, to start (a Spark _SUCCESS file):

 


OSError: Could not open parquet input source 'our-bucket/some/partitioned/dataset/_SUCCESS': Invalid: Parquet file size is 0 bytes
asfimport commented 4 years ago

Wes McKinney / @wesm: I agree it would be good for this bug to be fixed, but as this is a project formed of volunteers it's hard to say when it will be a priority for someone. For example, my organization does prioritize fixing bugs that affect our corporate sponsors, but for everything else we do our best to fix bugs when we can relative to our other priorities.

asfimport commented 4 years ago

David McGuire: I'm sympathetic to the economic reality of the FOSS funding model, which is why I'm trying to read between the lines on what I can expect to work well and get high-priority support. Just like any other project, I'm not asking you to change the priority, just clue me into it. :)

Would it be better to pursue debugging the new implementation, or continue with the old, given the priorities stemming from your corporate sponsorships?

asfimport commented 4 years ago

David McGuire: @wesm put more succinctly, where can I find and report defects in a place that might be more inline with the teams existing priorities?

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: [~dmcguire] yes, this is indeed less of a priority for us because it happens in legacy, soon to be deprecated, code (and not only from a corporate sponsorship standpoint, also from the general project standpoint). Which doesn't mean it wouldn't be nice to fix it, to be clear, since ParquetDataset is still used a lot.

should I start exploring the new implementation available starting with 1.0.0, and reporting defects against it?

Yes, I think that is best. Certainly if putting considerable effort in experimenting and using pyarrow for parquet datasets, that is best targetting the new implementation.

Seems like it trips up on something pretty simple, to start (a Spark _SUCCESS file):

It should be able to handle that (at least I thought we have tests for it), so if that is not working, very welcome to open a JIRA for it

asfimport commented 4 years ago

David McGuire: @jorisvandenbossche great information, thank you! I'll follow up on the other apparent defect, and hopefully, I'm just doing something wrong. If so, that might be a path forward to pursue with the Petastorm maintainer.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: (I was actually planning to open an issue on the Petastorm issue tracker to bring up their usage of ParquetDataset going to be deprecated in the future (after seeing the usage in the issues you opened), but didn't come to it yet)