Mute the java.lang.NegativeArraySizeException issue thrown when the content length of a WARC record exceeds half of Integer.MAX_VALUE.

GitHub issue(s): #317, #494

What does this Pull Request do?

The current AUT ArchiveRecord implementation eagerly consumes the content of the WARC record into a byte array and a String object. Problem is that not all WARC records can fit inside of a java.lang.String. TL;DR attempting to new String(byteArray) with byteArray that is longer than half of Integer.MAX_VALUE will cause java.lang.NegativeArraySizeException (for OpenJDK 11, UTF-8 charset), with reason being that java.lang.String creates an internal byte array that is double the size of the argument. And in Java, the maximum size of array is Integer.MAX_VALUE.

In RecordLoader.loadArchives, filter out WARCs whose content is longer than MAX_ALLOWABLE_WARC_CONTENT_LENGTH
Set MAX_ALLOWABLE_WARC_CONTENT_LENGTH to Integer.MAX_VALUE >> 1

How should this be tested?

An example of a WARC record that is too large can be found inARCHIVEIT-10689-TEST-JOB727752-SEED1799564-20190110143759592-00000-h3.warc.gz. Before this PR, any action invoked on this file will result in NegativeArraySizeException, this PR will skip the large record
This PR should still pass all the current tests in CI

Additional Notes:

As discussed in https://github.com/archivesunleashed/aut/issues/317#issuecomment-685799660, this PR merely mutes the issue with large WARCs, but it might still be reasonable for the users to access the content of a large WARC, perhaps in the form of InputStreams. This is already noted in #494.

@ruebot @ianmilligan1 @lintool

archivesunleashed / aut

Set the upper limit of WARC content length to half of Integer.MAX_VALUE #496

What does this Pull Request do?

How should this be tested?

Additional Notes:

Codecov Report