Closed robertoea closed 8 years ago
I have the same issue, really appreciate your help. (I have been trying to figure it out but no luck so far) Thank you for a great library.
Hello, sorry for not responding quickly, I'm currently in process of moving to another house, and didn't have much time for programming. Personally I haven't yet stumbled upon WARC files which lack this field. Could you please provide a link to a sample file - or steps to produce such files?
No worries, thank you for replying, moving house is very stressful ! Hope it goes smoothly. I was testing it against the warc file from the common crawl project - http://commoncrawl.org/2016/06/may-2016-crawl-archive-now-available/ - Happy to help in anyway I can.
This is the actual file i was testing against (it is 1gb in size) https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-22/segments/1464049270134.8/warc/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.gz
One option to produce such files is to take any warc.gz file, uncompress it, and then compress it with any generic gzipping tool (as the extra field is warc-specific).
I rewrote the scanning/indexing of the WARC files and git-pushed it. Files without the extra field can be openend now. However, since I have not (yet) found a way to efficently navigate around the file - opening it will take some time. Files from the common crawl project take like 3-5 minutes to open on my machine.
I'm considering adding support for the common crawl project's WAT files - since that will probably be faster :)
Thanks, I can confirm that files without the extra field can be opened now. It is normal and expected not to be able to efficiently navigate around the file without the extra field as it is indeed the purpose of the extra field.
Supporting WAT files is a great idea.
If a gzipped WARC file doesn't contain the extra field, line 84 in
GzipHeader.cs
will fail asCompressedSize
will be 0 andbr.BaseStream.Position
can't be negative.