Closed adamyy closed 4 years ago
Merging #496 into main will increase coverage by
0.01%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #496 +/- ##
============================================
+ Coverage 88.85% 88.86% +0.01%
Complexity 57 57
============================================
Files 43 43
Lines 1014 1015 +1
Branches 86 86
============================================
+ Hits 901 902 +1
Misses 74 74
Partials 39 39
Mute the
java.lang.NegativeArraySizeException
issue thrown when the content length of a WARC record exceeds half ofInteger.MAX_VALUE
.GitHub issue(s): #317, #494
What does this Pull Request do?
The current AUT
ArchiveRecord
implementation eagerly consumes the content of the WARC record into a byte array and a String object. Problem is that not all WARC records can fit inside of ajava.lang.String
. TL;DR attempting tonew String(byteArray)
withbyteArray
that is longer than half ofInteger.MAX_VALUE
will causejava.lang.NegativeArraySizeException
(for OpenJDK 11, UTF-8 charset), with reason being thatjava.lang.String
creates an internal byte array that is double the size of the argument. And in Java, the maximum size of array isInteger.MAX_VALUE
.RecordLoader.loadArchives
, filter out WARCs whose content is longer thanMAX_ALLOWABLE_WARC_CONTENT_LENGTH
MAX_ALLOWABLE_WARC_CONTENT_LENGTH
toInteger.MAX_VALUE >> 1
How should this be tested?
ARCHIVEIT-10689-TEST-JOB727752-SEED1799564-20190110143759592-00000-h3.warc.gz
. Before this PR, any action invoked on this file will result in NegativeArraySizeException, this PR will skip the large recordAdditional Notes:
As discussed in https://github.com/archivesunleashed/aut/issues/317#issuecomment-685799660, this PR merely mutes the issue with large WARCs, but it might still be reasonable for the users to access the content of a large WARC, perhaps in the form of InputStreams. This is already noted in #494.
@ruebot @ianmilligan1 @lintool