Closed sebastian-nagel closed 5 years ago
Actually, accessing record offset or length will cause that the entire record is consumed. It must be done after the record is processed.
Implemented with with 7e2f67a: by overriding the method iterate_records
WARC record and offset can be accessed. See #9 for an example how this can be utilized.
See this discussion: https://groups.google.com/d/topic/common-crawl/7MuqVmvajoA/discussion
Offset and length are not part of the ArcWarcRecord but are known only to the ArchiveIterator. Ideally, it should be possible to access WARC filename, record offset and length in the process_record method.