commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Update documentation to emphasize that querying the columnar index requires S3 access #44

Closed sebastian-nagel closed 7 months ago

sebastian-nagel commented 7 months ago

Review the sections related to data access schemes and the columnar index. Emphasize that querying the columnar index requires S3 access and is not possible using HTTP/HTTPS access.

See also the problem description on the Common Crawl group.