archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Set the upper limit of WARC content length to half of Integer.MAX_VALUE #496

Closed adamyy closed 4 years ago

adamyy commented 4 years ago

Mute the java.lang.NegativeArraySizeException issue thrown when the content length of a WARC record exceeds half of Integer.MAX_VALUE.

GitHub issue(s): #317, #494

What does this Pull Request do?

The current AUT ArchiveRecord implementation eagerly consumes the content of the WARC record into a byte array and a String object. Problem is that not all WARC records can fit inside of a java.lang.String. TL;DR attempting to new String(byteArray) with byteArray that is longer than half of Integer.MAX_VALUE will cause java.lang.NegativeArraySizeException (for OpenJDK 11, UTF-8 charset), with reason being that java.lang.String creates an internal byte array that is double the size of the argument. And in Java, the maximum size of array is Integer.MAX_VALUE.

How should this be tested?

Additional Notes:

As discussed in https://github.com/archivesunleashed/aut/issues/317#issuecomment-685799660, this PR merely mutes the issue with large WARCs, but it might still be reasonable for the users to access the content of a large WARC, perhaps in the form of InputStreams. This is already noted in #494.

@ruebot @ianmilligan1 @lintool

codecov[bot] commented 4 years ago

Codecov Report

Merging #496 into main will increase coverage by 0.01%. The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main     #496      +/-   ##
============================================
+ Coverage     88.85%   88.86%   +0.01%     
  Complexity       57       57              
============================================
  Files            43       43              
  Lines          1014     1015       +1     
  Branches         86       86              
============================================
+ Hits            901      902       +1     
  Misses           74       74              
  Partials         39       39