Closed lukaskawerau closed 5 years ago
Hi @lukaskawerau, the code is already there in the project cc-pyspark, see sparkcc.py, class CCIndexWarcSparkJob. The CCIndexWordCount gives an example how to process the HTML from the selected WARC records. Sorry, I should link the READMEs of both projects so that people searching for Python examples can find it.
I do not want to save the data as new WARC files
I've also tried to do write WARC files in PySpark but gave up: the WarcFileOutputFormat needs to be available, it seems not possible to write binary files from PySpark directly.
Ah, perfect, thank you!
Ok, let me know if you need more help. Could meet even locally - looks like we live in the same town.
I'm currently trying to build a pyspark-based version of something like the CCIndexWarcExport utility and struggle to get this to work properly.
My main problem is to properly read/process the byte output that I get from S3 when I try to access parts of a particular WARC file. As an example:
This request to S3 works, I get a proper
botocore.response.StreamingBody object
back in theBody
of the response. But where do I go from here to read the contents of this response? If I understand the code in CCIndexWarcExport correctly (and I probably don't because I don't know Java at all) what's happening is that the byte response is merged to the existing dataframe as is, correct?This dataframe is then saved as new WARC files to be read again (?).
However, if I do not want to save the data as new WARC files but continue processing it and have it available, how do I handle the bytes that I extracted?
Is there documentation anywhere that I should look at? I'm a bit stumped for how to proceed and would appreciate any help!