commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

Reduce log level of two classes called by the WAT/WET extractor to avoid that log files are flooded with multiple log messages per WARC record #33

Closed sebastian-nagel closed 8 months ago

sebastian-nagel commented 9 months ago

The logging of the classes ExtractingResourceProducer and GZIPSeriesMember is very verbose and produces per transformed WARC record multiple log messages:

Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries returnBytes
INFO: Returned (3165)bytes
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries read
INFO: read(8 bytes) bufferSize(3165)
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries getNextMember
INFO: getNextMember
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries read
INFO: read(3 bytes) bufferSize(3157)
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries getNextMember
INFO: AlignedResult:0
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries read
INFO: read(7 bytes) bufferSize(3154)
Oct 07, 2023 5:41:49 PM org.archive.format.gzip.GZIPMemberSeries getNextMember
INFO: Read next GZip header...
Oct 07, 2023 5:41:49 PM org.archive.extract.ExtractingResourceProducer getNext
INFO: Extracting (class org.archive.resource.warc.WARCResource) with (class org.archive.resource.http.HTTPResponseResourceFactory)

These messages generate 40+ MB of log output per WARC file (about 1 GiB in size). To avoid that log files are flooded, this PR changes the log level for these outputs from INFO to FINE. The level for messages which might indicate potential reasons for errors are left as is.