Closed sandeepsingh closed 7 years ago
Hi @sandeepsingh,
confirmed. Already the second record isn't successfully read:
% pip show warc | grep '^Version:'
Version: 0.2.1
% % python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import warc
>>> f = warc.open("CC-NEWS-20161001224340-00008.warc.gz")
>>> record = f.read_record()
>>> print(record.type)
warcinfo
>>> record = f.read_record()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 276, in read_record
return self.reader.read_record()
File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "/usr/local/lib/python2.7/dist-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found ''
>>>
WARC files from the main crawl are read successfully.
Other tools succeed reading this WARC file, e.g., ia-web-commons, cf. #2.
Tracked in DigitalPebble/sc-warc#3.
@sandeepsingh, as a temporary work-around and in case the WARC files are copied and unpacked anyway: if the value of the first Content-Length field is decremented by 2, the warc module should be able to read the file. This can be done by:
zcat CC-NEWS-20161001224340-00008.warc.gz \
| perl -lpe 's/^(Content-Length:\s+)(\d+)/$1.($2-2)/e..EOF' \
>CC-NEWS-20161001224340-00008.warc
Thanks for reporting the issue. I'll hope to deploy a fix during the next days and will then try to "repair" all existing news crawl archives.
thanks @sandeepsingh for reporting the problem and @sebastian-nagel for fixing it.
The fix for DigitalPebble/sc-warc#3 is deployed. TODO: reformat the old WARC files.
@sandeepsingh : I've tested the first "good" WARC file (s3://commoncrawl/crawl-data/CC-NEWS/2016/10/CC-NEWS-20161017145313-00000.warc.gz
) using the warc module. To read it completely you need to work-around internetarchive/warc#21.
@sandeepsingh @sebastian-nagel close?
Yes, for SC the issue is solved. Keep open until the old WARCs (60 files from August till mid of October) are fixed to avoid that someone reports the same problem again.
@sandeepsingh I'm trying to parse warc.gz file using warc library. I'm getting the similar issue IOError: Expected '\r\n', found 'WARC/1.0\r\n' could you please tell me how to fix it ?
Hi @RahulGuptaIIITA, could you share more details? - Which library and version, programming language, which WARC files causes the error? Without being able to reproduce the problem I cannot even try to fix it. Thanks!
@sebastian-nagel Actually, the warc file which im trying to process is different. But i'm facing the same issue which was mentioned by @sandeepsingh .
Im using python library warc to parse it. Apologies for providing less information. But still if you can help me with this general issue, then it would be a big help. Thanks!
I'm aware that about 60 WARC files of the news crawler are still invalid and fail to parse using the warc module, see the comment above. I hope to get them fixed, if not I'll delete them. But the point is: your problem can be caused by an invalid WARC file or a bug in the warc module. If you cannot share the WARC file, could you try another module: warcio. It's more reliable from my experience, the warc module can only be used with an external decompressor because of the open bug internetarchive/warc#21. If warcio also fails the problem is likely to be caused by a broken WARC file.
Ignore the first two months CC-NEWS data, start from 2017 January.
WARCIO also does not work.
What i did: catched error records in exceptions and ignored those records from parsing.
Reason for error: I think there was some issue with initial working of CC storm-crawler which they had deployed it for the first time for CC-NEWS crawling, replacing legacy Nutch.
@sebastian-nagel Thanks for the information. @sandeepsingh I tried using WARCIO. It worked fine on 'arc' format but failing on 'warc'.
The error I'm getting is mentioned below. raise ArchiveLoadFailed(msg + str(se.statusline)) warcio.recordloader.ArchiveLoadFailed: Unknown archive format, first line: ['WARC-Type:', 'response']
I tried using exception but no luck.
@sandeepsingh I got the issue and fixed it. While unzipping the file, it was getting corrupt. I used zipped file directly to read a record from and it worked fine. Thanks !
@sandeepsingh: all invalid CC-NEWS WARC files from Sept/Oct 2016 are now fixed. Closing this issue. Thanks!
@RahulGuptaIIITA: ok, understood. Decompressing WARC files using gzip or gunzip should be safe. However, recompressing may harm: some tools fail if WARC files are not compressed per record.
Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package snippet to read warc file: import warc f = warc.open("CC-NEWS-20161001224340-00008.warc") for record in f: if record['Content-Type'] == 'application/http; msgtype=response': payload = record.payload.read() headers, body = payload.split('\r\n\r\n', 1) if 'Content-Type: text/html' in headers:
do my processing with html content (body)
But when i run this i am getting this error: Traceback (most recent call last): warc_process.py", line 69, in
read_entire_warc("CC-NEWS-20160926211809-00000.warc")
File "warc_process.py", line 54, in read_entire_warc
for record in f:
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'
Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz CC-NEWS-20161001224340-00008.warc.gz CC-NEWS-20161002224346-00009.warc.gz CC-NEWS-20161003130443-00010.warc.gz CC-NEWS-20161004130444-00011.warc.gz CC-NEWS-20161005130450-00012.warc.gz CC-NEWS-20161005152607-00013.warc.gz CC-NEWS-20161006152607-00014.warc.gz CC-NEWS-20161006191324-00015.warc.gz CC-NEWS-20161007191326-00016.warc.gz CC-NEWS-20161008015559-00017.warc.gz CC-NEWS-20161009015614-00018.warc.gz CC-NEWS-20161010001731-00019.warc.gz