commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

HDFS Patch #5

Closed cronoik closed 3 years ago

cronoik commented 6 years ago

Hi,

the following patch allows you to read files from hdfs.

sebastian-nagel commented 3 years ago

Thanks, @cronoik! Took a long time but finally I was able to test your PR. To resolve conflicts I've created a new PR (#26). One additional improvement seen while testing: I keep the hdfs:// URI as is so that all variants to specify a HDFS file location are supported: