bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Split logging between stderr and stdout #37

Open nvanva opened 1 year ago

nvanva commented 1 year ago

Currently all logs are written to stderr. This makes it difficult to find error messages in those logs. Writing only error messages to stderr and the rest to stdout will make it simpler to investigate if there were any errors, especially when running many warc2text processes in parallel and redirecting stdout and stderr to different files.

jelmervdl commented 1 year ago

I've been using stdout for #34 and am in the camp of "stdout is for output, not UI" so that I can pipe things together. Using stdout for messages to the user would make that impossible.

If you want to split verbose logging from error messages, I propose to use a command line option, e.g. --log-file, that writes the verbose messages (a record was filtered due to url filter, that kind of stuff) to a separate file. This would also make it optional so it doesn't need any changes in bitextor. Edit: and if you really want the log messages to go to stdout, you can use --log-file=/dev/stdout or --log-file=-.

That being said, the only error message that doesn't terminate warc2text is when a warc archive contains broken gzip records (which could indicate file corruption). All others either are the last message to be printed before warc2text dies with a non-zero exit code which seems pretty reasonable to me.

A different annoyance I've had: if you're running multiple warc2text processes through parallel, warc2text will not prefix the logging messages with the name (and offset maybe?) of the warc that the message is about. Right now you need to recollect all messages from a single warc2text in order, and then go through it from top to bottom to figure out which warc is the source of any of the messages. Running warc2text with just a single warc archive, and letting parallel do the log grouping is also not an option since then you can't combine the output of multiple warcs easily and you end up with many more files on disk.