feyris-tan / WarcSharp

A C# Library for working with WARC files.
Other
7 stars 3 forks source link

Cannot read files that don't have WARC_EXTRA_DATA #1

Closed robertoea closed 8 years ago

robertoea commented 8 years ago

If a gzipped WARC file doesn't contain the extra field, line 84 in GzipHeader.cs will fail as CompressedSize will be 0 and br.BaseStream.Position can't be negative.

codeprod commented 8 years ago

I have the same issue, really appreciate your help. (I have been trying to figure it out but no luck so far) Thank you for a great library.

feyris-tan commented 8 years ago

Hello, sorry for not responding quickly, I'm currently in process of moving to another house, and didn't have much time for programming. Personally I haven't yet stumbled upon WARC files which lack this field. Could you please provide a link to a sample file - or steps to produce such files?

codeprod commented 8 years ago

No worries, thank you for replying, moving house is very stressful ! Hope it goes smoothly. I was testing it against the warc file from the common crawl project - http://commoncrawl.org/2016/06/may-2016-crawl-archive-now-available/ - Happy to help in anyway I can.

codeprod commented 8 years ago

This is the actual file i was testing against (it is 1gb in size) https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-22/segments/1464049270134.8/warc/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.gz

robertoea commented 8 years ago

One option to produce such files is to take any warc.gz file, uncompress it, and then compress it with any generic gzipping tool (as the extra field is warc-specific).

feyris-tan commented 8 years ago

I rewrote the scanning/indexing of the WARC files and git-pushed it. Files without the extra field can be openend now. However, since I have not (yet) found a way to efficently navigate around the file - opening it will take some time. Files from the common crawl project take like 3-5 minutes to open on my machine.

I'm considering adding support for the common crawl project's WAT files - since that will probably be faster :)

robertoea commented 8 years ago

Thanks, I can confirm that files without the extra field can be opened now. It is normal and expected not to be able to efficiently navigate around the file without the extra field as it is indeed the purpose of the extra field.

Supporting WAT files is a great idea.