commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Close streams when they are from files #2

Closed lauraedelson closed 6 years ago

lauraedelson commented 6 years ago

My circumstances dictated that I had to download CC files from AWS and then parse then on my local cluster. When I did this I noticed that file handles weren't being closed, which quickly eats through available memory. Simple fix, though!

sebastian-nagel commented 6 years ago

Good catch! Thanks, @lauraedelson! However, why not call the close() in a finally block?

            except ArchiveLoadFailed as exception:
                ...
            finally:
                stream.close()

Also not that the 'file:' prefix is stripped from uri, so the if-condition will never get true.

lauraedelson commented 6 years ago

Hi Sebastian, Moved the check to a finally block, as requested! Actually, the 'file:' prefix isn't stripped. I'm using this code now on my branch, and the prefix is there and it works correctly. I did it that way because the code was already recognizing file streams by the prefix, as on line 185..

sebastian-nagel commented 6 years ago

Actually, the 'file:' prefix isn't stripped.

Ok, I see: base_dir is added as prefix of uri. But then it depends on how the program is called whether the condition becomes true or not. I've tried it calling $SPARK_HOME/bin/spark-submit ./word_count.py ... and the input uri is not prefixed with file:. But this may also depend on the Spark version used.

Anyway, it should not harm to also close the temporary files for s3:// input. But I'll test ed9835e next week to verify that there are no problems. Thanks, @lauraedelson!

sebastian-nagel commented 6 years ago

I've tested the general finally: stream.close() in local mode and on a Hadoop cluster. No issues. Thanks, @lauraedelson!