commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Allow to access WARC record filename and offset #6

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 6 years ago

See this discussion: https://groups.google.com/d/topic/common-crawl/7MuqVmvajoA/discussion

Offset and length are not part of the ArcWarcRecord but are known only to the ArchiveIterator. Ideally, it should be possible to access WARC filename, record offset and length in the process_record method.

sebastian-nagel commented 5 years ago

Actually, accessing record offset or length will cause that the entire record is consumed. It must be done after the record is processed.

sebastian-nagel commented 5 years ago

Implemented with with 7e2f67a: by overriding the method iterate_records WARC record and offset can be accessed. See #9 for an example how this can be utilized.