commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
321 stars 35 forks source link

News WARC files processing issue. #11

Closed sandeepsingh closed 7 years ago

sandeepsingh commented 8 years ago

Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package snippet to read warc file: import warc f = warc.open("CC-NEWS-20161001224340-00008.warc") for record in f: if record['Content-Type'] == 'application/http; msgtype=response': payload = record.payload.read() headers, body = payload.split('\r\n\r\n', 1) if 'Content-Type: text/html' in headers:

do my processing with html content (body)

But when i run this i am getting this error: Traceback (most recent call last): warc_process.py", line 69, in read_entire_warc("CC-NEWS-20160926211809-00000.warc") File "warc_process.py", line 54, in read_entire_warc for record in f: File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter record = self.read_record() File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record self.finish_reading_current_record() File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record self.expect(self.current_payload.fileobj, "\r\n") File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect raise IOError(message) IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz CC-NEWS-20161001224340-00008.warc.gz CC-NEWS-20161002224346-00009.warc.gz CC-NEWS-20161003130443-00010.warc.gz CC-NEWS-20161004130444-00011.warc.gz CC-NEWS-20161005130450-00012.warc.gz CC-NEWS-20161005152607-00013.warc.gz CC-NEWS-20161006152607-00014.warc.gz CC-NEWS-20161006191324-00015.warc.gz CC-NEWS-20161007191326-00016.warc.gz CC-NEWS-20161008015559-00017.warc.gz CC-NEWS-20161009015614-00018.warc.gz CC-NEWS-20161010001731-00019.warc.gz

sebastian-nagel commented 8 years ago

Hi @sandeepsingh,

confirmed. Already the second record isn't successfully read:

% pip show warc | grep '^Version:'
Version: 0.2.1
% % python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import warc
>>> f = warc.open("CC-NEWS-20161001224340-00008.warc.gz")
>>> record = f.read_record()
>>> print(record.type)
warcinfo
>>> record = f.read_record()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 276, in read_record
    return self.reader.read_record()
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 364, in read_record
    self.finish_reading_current_record()
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 360, in finish_reading_current_record
    self.expect(self.current_payload.fileobj, "\r\n")
  File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 352, in expect
    raise IOError(message)
IOError: Expected '\r\n', found ''
>>> 

WARC files from the main crawl are read successfully.

sebastian-nagel commented 8 years ago

Other tools succeed reading this WARC file, e.g., ia-web-commons, cf. #2.

sebastian-nagel commented 8 years ago

Tracked in DigitalPebble/sc-warc#3.

@sandeepsingh, as a temporary work-around and in case the WARC files are copied and unpacked anyway: if the value of the first Content-Length field is decremented by 2, the warc module should be able to read the file. This can be done by:

zcat CC-NEWS-20161001224340-00008.warc.gz  \
  | perl -lpe 's/^(Content-Length:\s+)(\d+)/$1.($2-2)/e..EOF' \
  >CC-NEWS-20161001224340-00008.warc

Thanks for reporting the issue. I'll hope to deploy a fix during the next days and will then try to "repair" all existing news crawl archives.

jnioche commented 8 years ago

thanks @sandeepsingh for reporting the problem and @sebastian-nagel for fixing it.

sebastian-nagel commented 8 years ago

The fix for DigitalPebble/sc-warc#3 is deployed. TODO: reformat the old WARC files.

@sandeepsingh : I've tested the first "good" WARC file (s3://commoncrawl/crawl-data/CC-NEWS/2016/10/CC-NEWS-20161017145313-00000.warc.gz) using the warc module. To read it completely you need to work-around internetarchive/warc#21.

jnioche commented 7 years ago

@sandeepsingh @sebastian-nagel close?

sebastian-nagel commented 7 years ago

Yes, for SC the issue is solved. Keep open until the old WARCs (60 files from August till mid of October) are fixed to avoid that someone reports the same problem again.

RahulGuptaIIITA commented 7 years ago

@sandeepsingh I'm trying to parse warc.gz file using warc library. I'm getting the similar issue IOError: Expected '\r\n', found 'WARC/1.0\r\n' could you please tell me how to fix it ?

sebastian-nagel commented 7 years ago

Hi @RahulGuptaIIITA, could you share more details? - Which library and version, programming language, which WARC files causes the error? Without being able to reproduce the problem I cannot even try to fix it. Thanks!

RahulGuptaIIITA commented 7 years ago

@sebastian-nagel Actually, the warc file which im trying to process is different. But i'm facing the same issue which was mentioned by @sandeepsingh .

Im using python library warc to parse it. Apologies for providing less information. But still if you can help me with this general issue, then it would be a big help. Thanks!

sebastian-nagel commented 7 years ago

I'm aware that about 60 WARC files of the news crawler are still invalid and fail to parse using the warc module, see the comment above. I hope to get them fixed, if not I'll delete them. But the point is: your problem can be caused by an invalid WARC file or a bug in the warc module. If you cannot share the WARC file, could you try another module: warcio. It's more reliable from my experience, the warc module can only be used with an external decompressor because of the open bug internetarchive/warc#21. If warcio also fails the problem is likely to be caused by a broken WARC file.

sandeepsingh commented 7 years ago

Ignore the first two months CC-NEWS data, start from 2017 January.

WARCIO also does not work.

What i did: catched error records in exceptions and ignored those records from parsing.

Reason for error: I think there was some issue with initial working of CC storm-crawler which they had deployed it for the first time for CC-NEWS crawling, replacing legacy Nutch.

RahulGuptaIIITA commented 7 years ago

@sebastian-nagel Thanks for the information. @sandeepsingh I tried using WARCIO. It worked fine on 'arc' format but failing on 'warc'.

The error I'm getting is mentioned below. raise ArchiveLoadFailed(msg + str(se.statusline)) warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['WARC-Type:', 'response']

I tried using exception but no luck.

RahulGuptaIIITA commented 7 years ago

@sandeepsingh I got the issue and fixed it. While unzipping the file, it was getting corrupt. I used zipped file directly to read a record from and it worked fine. Thanks !

sebastian-nagel commented 7 years ago

@sandeepsingh: all invalid CC-NEWS WARC files from Sept/Oct 2016 are now fixed. Closing this issue. Thanks!

@RahulGuptaIIITA: ok, understood. Decompressing WARC files using gzip or gunzip should be safe. However, recompressing may harm: some tools fail if WARC files are not compressed per record.