PDF conversion progress insight and logging

toncid commented 3 months ago

Hello, we are often getting various Grover or protocol timeout errors, which are not very helpful, as it is unclear at which stage the timeout happened.

I haven't found a way to enable logging in order to trace the progress of PDF conversion. It would be good if Grover is able to take a logger and invoke it as the call to_pdf makes progress:

Grover initialized
HTML URL/source received
Puppeteer started
Chrome v123.456 started
Page is loading
Page loaded (e.g. when fired DOMContentLoaded, load, network0/2, etc.)
Conversion started
Conversion ended
Capturing PDF

Hope the above paints the picture of what kind of insight is desirable.

If it is already possible, please share and we can work on updating the README.

abrom commented 3 months ago

Hi @toncid have you read the section in the README on debugging?

It's not clear exactly what you mean by "invoking a logger" given the props passed through to the NodeJS are serialised. There could be an option for dumping progress out to a log file but this starts to get messy given you could have multiple processes dumping entries at the same time. You'd need to have some unique log tagging key to go with it, or unique log files per invocation. Either way, going down the debugging route already laid out in the readme would seem a better option all round

toncid commented 2 months ago

Hello @abrom, thank you for your response. I was thinking of logging steps from the Grover side, around the actual invocation of Puppeteer, but I assume there isn't much to log.

However, in production systems, there can be multiple workers running Grover and Puppeteer, so it is practically impossible to get live debugging when needed.

Do you know any options to gather console and telemetry output from such setups? I wasn't able to find any way to do it (e.g. setting dumpio doesn't seem to show any output in server logs).

abrom commented 1 month ago

hmm.. good question. Because grover is already using the stdout/stderr channels for result/error comms it'd likely need to be passed some other IO (ie a file path) where it can be told to log to.. then it'd be pretty trivial to have debug information piped there. If the calling service was controlling the log path as an option it'd also make it easier to "manage" multiple concurrent invocations by just giving each a different path. Some log management/cleanup would likely be prudent!

BUT.. given that this could be an exceptionally destructive action (eg someone runs the process as a super user then a bad actor passes through the "log file" option as some import system file), any log file option would need to be excluded from anything that could be configured via a request. That shouldn't be a big deal, but not to be dealt with lightly.

in the processor JS it'd be something like:

const fs = require('node:fs/promises');
....

const debugLogFile = options.debugLogFile; delete options.debugLogFile;
....

if (debugLogFile) await fs.appendFile(debugLogFile, '... the log message ...');
... etc

Then in ruby land you'd use it as such (or similar):

Grover.new(<content>, debug_log_file: File.join('/tmp', request[:uuid], '.log')).to_pdf

Studiosity / grover

PDF conversion progress insight and logging #235