commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
107 stars 9 forks source link

CCIndexWarcExport - Equivalent in Pyspark #5

Closed lukaskawerau closed 5 years ago

lukaskawerau commented 5 years ago

I'm currently trying to build a pyspark-based version of something like the CCIndexWarcExport utility and struggle to get this to work properly.
My main problem is to properly read/process the byte output that I get from S3 when I try to access parts of a particular WARC file. As an example:

import boto3

client = boto3.client('s3',
    aws_access_key_id="key",
    aws_secret_access_key="secret")

offset = 928
length = 650
segment_range = "%s-%s" % (offset, offset + length -1 )
response = client.get_object(Bucket = 'commoncrawl', 
    Key = 'crawl-data/CC-MAIN-2018-47/segments/1542039744381.73/crawldiagnostics/CC-MAIN-20181118135147-20181118160608-00061.warc.gz', 
    Range = segment_range)

This request to S3 works, I get a proper botocore.response.StreamingBody object back in the Body of the response. But where do I go from here to read the contents of this response? If I understand the code in CCIndexWarcExport correctly (and I probably don't because I don't know Java at all) what's happening is that the byte response is merged to the existing dataframe as is, correct?
This dataframe is then saved as new WARC files to be read again (?).
However, if I do not want to save the data as new WARC files but continue processing it and have it available, how do I handle the bytes that I extracted?
Is there documentation anywhere that I should look at? I'm a bit stumped for how to proceed and would appreciate any help!

sebastian-nagel commented 5 years ago

Hi @lukaskawerau, the code is already there in the project cc-pyspark, see sparkcc.py, class CCIndexWarcSparkJob. The CCIndexWordCount gives an example how to process the HTML from the selected WARC records. Sorry, I should link the READMEs of both projects so that people searching for Python examples can find it.

I do not want to save the data as new WARC files

I've also tried to do write WARC files in PySpark but gave up: the WarcFileOutputFormat needs to be available, it seems not possible to write binary files from PySpark directly.

lukaskawerau commented 5 years ago

Ah, perfect, thank you!

sebastian-nagel commented 5 years ago

Ok, let me know if you need more help. Could meet even locally - looks like we live in the same town.