Database v3 - Githubissues

NAICNO / Jobanalyzer

Easy to use resource usage report

MIT License

0 stars 1 forks source link

This is the successor to #379. Some things to consider:

[ ] We could build indices for some filtering criteria to more quickly narrow the queries. Obvious cases are from/to dates, host (node), user, maybe command name. From/to and host are already such indices due to the structure of the database. We could similarly present other major indices to the stream selector, probably roll all these into some structure with a little logic around it.
[ ] The current structure of the database is one file per day per node. Consider 2000 nodes, this yields 2000 files (inodes) per day. Suppose there's an inode limit per user of 1e6. We will hit this limit in 500 days - could this be a problem?
[ ] In general, "older" data should be archived automatically and whenever those data are required, which is almost never, the archives could be opened and the files read from the archive. The sensible thing would be to create one archive per month for data that are multiple months old, the archive should be some standard (probably compressed) form that can be opened using standard Go APIs and also standard command line tools. There would be a single archive for all files under that month - node csv data as well as sysinfo data. Archived folders must be considered read-only, which slightly complicates the internal structure of the program.

A completely different take on this is that we should jettison the database component of Jobanalyzer and build a new one around a standard data warehouse engine, TBD. This would give us a lot more resilience and probably (on balance) reduce complexity.

There are issues with this move. Currently the analysis logic is based on stream-of-samples processing. There can be a very large volume of samples in a given time window, and I/O isn't disappearing as a problem just because we move to a database system, quite the contrary. To make use of a database system we'd probably want to preprocess data as they come in, partitioning the data into jobs and nodes, so that they can be more easily accessed for the tasks we need. At the same time, there may be utility in keeping the original data streams (or something like them) since we don't know all the uses for them yet. So we'd be storing more data, but more of it would be in a directly useful form hopefully and the net performance gain would be significant. Some experimentation and discussion would be warranted. For example, combining the current database with an RDBMS for aggregated data (job-centric view for jobs keyed by job, user, etc) might be a sensible solution too.

NAICNO / Jobanalyzer

Database v3 #517