commoncrawl / cc-warc-examples

CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
MIT License
37 stars 19 forks source link

Unexpected end of ZLIB input stream #4

Closed mazzespazze closed 5 years ago

mazzespazze commented 5 years ago

Hej all!

This repo is really fast in reading the archives. I downloaded the first 100 segments of warc and I was running your file cc-warc-examples/src/org/commoncrawl/examples/WARCReaderTest.java

Sadly it showed this error:

at org.archive.util.zip.OpenJDK7InflaterInputStream.fill(OpenJDK7InflaterInputStream.java:244) at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:162) at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122) at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113) at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204) at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:516) at com.nonorank.language_detection.WARCReaderTest.main(WARCReaderTest.java:38)

The "issue" is that I tried it on 3 different machines: two with Ubuntu (14.04 and 18.10) and the third one with Fedora 28. Each of the three present the error in different stages: Ubuntu 14.04 shows it immediately at the first segment after few seconds. Ubuntu 18.10 shows it around half at 54th segment. Fedora 28 shows it randomly.

My question is: how can I fix it? And why is not deterministic?

ps: I have noticed that sometimes this error is due to having truncated gzip files in input. And this is obviously a user error (my error) but it does happen other times as well.

sebastian-nagel commented 5 years ago

Hi @mazzespazze,

can you share the full stack and also share any modifications to the project (I guess there are some) including added or changed dependencies? WARCReaderTest is a class intended for testing not for production, so no exceptions are caught. Please have a look at other classes in the package "org.commoncrawl.examples.mapreduce" where exceptions are properly caught and logged.

I only can confirm that I use this project to process 10,000s of WAT files (WARC files with JSON payload) every month on Hadoop and I've never seen any issues.

mazzespazze commented 5 years ago

I have just find the issue. Sometimes the files get truncated due to the wget on amazonas. Therefore it was not java fault, but mine not checking that the downloaded was done correctly.

sebastian-nagel commented 5 years ago

Thanks, @mazzespazze for the clarification!