Closed mazzespazze closed 5 years ago
Hi @mazzespazze,
can you share the full stack and also share any modifications to the project (I guess there are some) including added or changed dependencies? WARCReaderTest is a class intended for testing not for production, so no exceptions are caught. Please have a look at other classes in the package "org.commoncrawl.examples.mapreduce" where exceptions are properly caught and logged.
I only can confirm that I use this project to process 10,000s of WAT files (WARC files with JSON payload) every month on Hadoop and I've never seen any issues.
I have just find the issue. Sometimes the files get truncated due to the wget on amazonas. Therefore it was not java fault, but mine not checking that the downloaded was done correctly.
Thanks, @mazzespazze for the clarification!
Hej all!
This repo is really fast in reading the archives. I downloaded the first 100 segments of warc and I was running your file cc-warc-examples/src/org/commoncrawl/examples/WARCReaderTest.java
Sadly it showed this error:
The "issue" is that I tried it on 3 different machines: two with Ubuntu (14.04 and 18.10) and the third one with Fedora 28. Each of the three present the error in different stages: Ubuntu 14.04 shows it immediately at the first segment after few seconds. Ubuntu 18.10 shows it around half at 54th segment. Fedora 28 shows it randomly.
My question is: how can I fix it? And why is not deterministic?
ps: I have noticed that sometimes this error is due to having truncated gzip files in input. And this is obviously a user error (my error) but it does happen other times as well.