cat-log: support large files

oliver-sanders commented 1 year ago

Currently we cap the log at 5000 lines.

This is to protect the browser, the GUI could have multiple views (incl logs) open simultaneously. Long log files cause issues with Cylc Review (which also hits this problem). Past a hard-coded limit Cylc Review will only offer a "raw" file view which exists outside of the regular UI and scales better.

Two options for the Cylc log view:

Paginate the log files.
- We can load the logs in X line chunks.
- E.G. to start reading from line 5000 we could do tail -n +5000.
- This would involve adding an "offset" argument to the log subscription.
- Ideally we would be able to query the number of lines in the file so the GUI could display this, however, I'm not sure we can do that without reading the file?
Offer the ability to open long log files in another browser tab ala Cylc Review.
- This would mean building a light-weight web app to display the log lines in a simple <code> block.
- This app would need to perform the same authentication as the regular UI (a couple of lines of code which bung the token into a cookie).
- It would make the problem easier if we added a REST endpoint for accessing the job logs just for this view (no need for a GraphQL client in the new app).

Pull requests welcome!

ejhyer commented 1 year ago

I'm glad this is on your radar. I'd be happy with either solution proposed above, I suppose slightly happier with #2 (open entire log in a separate tab).

oliver-sanders commented 1 year ago

Definitely on our radar and we would like to get onto this one soon. Option 2 is possible, and simple on the surface, just to explain why we haven't done it yet...

With a REST approach, the server would need to read in the entire log file before it could send it to the client. This means that if the log file is 2GB (sadly not unusual), then the server will require [at lest] 2GB of memory to process the request. If there isn't 2GB available the server will crash.

With a server running stand-alone on someone's machine, this may be ok, but with cylc hub, all of these servers will be running on shared instances where there is resource contention with other servers. So there might not be enough RAM headroom to store the file in memory, even briefly. These servers are also multi-user so we have to handle the possibility of a large number of users requesting files simultaneously.

ejhyer commented 1 year ago

I appreciate the explanation; that is a difficult problem, especially thinking ahead to cylc hub. It sounds like the solution will include some component of ‘best practices’ e.g. “If you want your logs to be fully viewable in the cylc UI keep them under N lines, and direct large log outputs elsewhere (so that they don’t pass through cycl ui or cylc hub).” I’m fine with that; making the best use of cylc generally demands at least some refactoring, in my experience.

oliver-sanders commented 1 year ago

Terminology:

"File Viewer": A text editor which opens files in read only mode.

It may be that the best option here is to use a virtual scrolling approach.

In this approach the UI would render a long but empty page in the place of the log file.
It would then track the scroll position on the page to work out which lines are currently visible.
It would then perform async requests for the "chunks" of the log file which cover this range of lines.

This approach allows us to only load in the bits of the file that the user is actively looking at, and unload bits of the file they are no longer looking at to avoid memory issues in the browser. Because the file is loaded in chunks, the UIS shouldn't run out of memory loading any single chunk.

The reason I've not suggested this before is because it's complex and makes a bit of a mess of some of our other plans. There are a collection of functionalities that users have requested including:

(As well as a bunch of others that user's will likely request)

We were hoping to use an off-the-shelf file viewer to display the log file and solve these issues in one shot.

Unfortunately, off-the-shelf file viewers are not necessarily compatible with virtual scrolling and most lexers (used for syntax highlighting) are not compatible with a file being loaded in chunks. So this approach will put a burden on us to develop our own file viewer and implement/integrate all of these features ourselves. This is something we have been trying to avoid doing.

We should have a poke around to see whether there are any projects we can benefit from. When you see virtual scrolling text files in the real world they appear bespoke so I doubt there are any off-the-shelf options, but it's worth a look. We need:

TextMate grammar support (not Tree Sitter).
Async chunk loading.
The last chunk of the file will be incomplete (because we're tailing).

Assuming there are no off-the-shelf options, we would need to implement our own file viewer, but hopefully we can use an off-the-shelf virtual scroller e.g:

https://vue-virtual-scroller-demo.netlify.app/

ejhyer commented 1 year ago

Obviously you've thought about this a lot, and I don't think my experience is representative enough one way or the other to say what's best. BUT I'll throw in my thoughts anyway: my feeling is that those additional features you cite (line numbers, syntax highlight, word wrapping) are things I would rather have than long-log support. In other words, if I had to choose between A) the existing log viewer supporting oversize logs, and B) a log viewer that provided word wrapping and highlighting for reasonably-sized logs but just gave a log too long; please view offline at /crazy/long/path/to/log/files/ message when the log was too long, I would choose option A.

ColemanTom commented 2 weeks ago

As an interim option, would it be possible to at least choose to show the first or last 5000 lines? At the moment it shows the first 5000 lines, but, many times, you want to view the end rather than the start (but sometimes you want to be able to view both). Or allow people to put in a line range to load for large files?

hjoliver commented 2 weeks ago

Hmm. we could use sed on the back end, instead of cat, to provide a requested range of lines:

$ set -n 3000,4000p FILE

Weirdly, this seems a lot faster:

$ tail -n "+3000" FILE | head -n 1001"

cylc / cylc-uiserver

cat-log: support large files #421