Gather stats - Githubissues

We've had the idea to gather solr search stats for a long time. Once in a while resource utilisation stats came up as well.

I just wanted to start a discussion on how such statistics could be gathered and evaluated.

Since the new databrowser-api already implements saving search queries into a MongoDB I would suggest we use a similar approach for all other statistics. Yet there are a couple of questions:

Should the statistics gather tool be a service on its own, that is seperated from the databrowser-api, or shall we just leave it a part of the databrowser-api? The latter would be simpler but less clear. I'd prefer a clear separation.
How should we save statistics? In the current approach the client would have to make a mongoDB connection to store the data. That means that a) any client software needs mongoDB software as a dependency, b) passwords usernames etc needed to be "communicated" to the client. We could bypass b) by making the mongoDB world writable but not readable (I don't know whether this is possible).

My answer to those questions would be setting up a dedicated statistics service with a dedicated simple RESTapi. Clients would only have to make requests to store data (without authentication). This would also allow for something like jsonSchema validation. Similar the statistics could be gathered after admin username and password are provided. This would make sure that only privileged people would have read access to the statistics.

Another question is what data do we want to store. Could I kindly ask to gather things, aside form databrowser query statistics. My suggestion would be:

Plugins: plugin config, avg. CPU usage, max CPU usage, avg. mem usage, max mem usage, avg. network load, max network load, virtual mem and CPU frequency available on the machine, hashed username, runtime, plugin status (success, failed, etc)
Databrowser: search queries.

Anything else that is missing?

The search queries are taken care of by the databrowser-api, hence straight forward since already implemented.

The plugin stats is a little more complicated. I think we would have to implement a daemon that uses psutils to gather those stats and adds them to the statistics service. I think we would need some sort of a thread (ideally async) with a start stop pattern that frequently gathers data. Async because we want the thread not only gather the data but to also add it to the DB, because if jobs get killed we will have at least some data in the DB. Ideally this daemon should run in a subprocess but I think then communication back and forward with the parent process gets tricky, therefore a thread might be the only thing that is left? Unless we make the whole plugin manager async 😝.

@eelucio @ckadow @eplesiat Input on the statistics would be good? Also if there would be need for any GPU stats and if so what type.

@Karinon any thoughts on the design?

General Personally I am still not the biggest fan of yet another database. If we still end up with it I would strongly prefer that it runs in its own container and that all programs that write to it can live without it, therefore a REST-API doesn't sound too bad. I wouldn't think too much about the security implications as I would assume that the machine runs inside a firewall, but a secured read and an open write sound fine to me, as well.

I have two more questions:

How (and where) do you plan to visualize the stats?
What is our retention policy? Personally I don't think it would be smart to store the data forever and I think it would be good if we would create reports, like once half a year and then clear the database.

Optional service Regarding living without the service: I would strongly prefer that we have the possibility to "opt-out" of it during the deployment (like: don't provide a url to the service -> No stats AND No pestering in the logs that it can't write the stats) and that we don't force this service to the Outside-Of-DKRZ-Installations of Freva.

Plugin stats Regarding plugin-stats I have no opinion as I also have no idea how to gather those information. We should be a bit cautious that those stats and the stats-calculating don't get too large and we should measure if our stat-measures slow down the actual plugin-run before we put it into production...

It might also be interesting to ask the Systems department if there is already something available

Alternatives One alternative approach which does not include reinventing the wheel would be to store all the data inside an Elastic-Stack. Don't get me wrong, this approach would have issues on it's own which we need to discuss and I would only prefer it if we have a running administered instance here at DKRZ (which I think we have, but I don't know where, this is something to find out).

I would like to discuss the alternative in a meeting when you are back.

Thanks for the input.

I wouldn't be too worried about resource usage when collecting resource usage. I'll will see how much resources this takes.

About the optionality, ok agreed it should be straight forward to do that.

The systems group uses slurm for statistics which is fine because DKRZ as an institution decided to use slurm. I wouldn't want to rely on slurm as we have made the decision to keep the workload manager integration as general as possible. For example the DWD (I will get in touch with them in March) doesn't use slurm but PBS.

About yet another database type. Once we have a stats instance with a mildly functioning rest API up and running we could think of transitioning the other parts that uses db storage towards that storage interface. I am kind of liking mongoDB or any other no SQL approach because it allows being more flexible with what kind of data we store. And since changes in data structure happen quite often mongo offers the flexibility to cater for those changes without headaches.

I think that the stats you propose are basically all (and more) that I had in mind.

maybe add 3 more:

Plugins: plugin config, avg. CPU usage, max CPU usage, avg. mem usage, max mem usage, avg. network load, max network load, virtual mem and CPU frequency available on the machine, hashed username, runtime, plugin status (success, failed, etc), output size (e.g. in GB), databrowser query (if it was), is_interactive_job.

data queries: looking at the docker solr server logs we realised that there is no way to differenciate between mere data browsing and that done while launching the plugin. Worst case scenario you could always run the numbers to tell them appart if we know the total queries and suppose that 1 plugin launch <= 1 data search. But for what we saw in the logger the way solr stores the queries is everytime a facet is searched or so (specially with the web, apparently is very linked to the clicks) so it might not be as clear.
output size: having accessible either at the share/slurm/ or better in a databse would help better estimating the average data production per analysis also for a faster audit of the project without the need of du -sh the user folders, which is not always possible.
is_interactive_job: I was )looking through the code but I do not have clear anymore if we can know wether the job was run in a login node or via workloader. This might also be helpful. For now, we can do it indirectly looking at the output log file suffix (I am unable to connect to the DB right now to check)

regarding on how to implement that:

probably make it optional would be the best as you both agree because our interests do not necessarily align with other isntances installed elsewhere.
unify all in mongoDB. I have no opinion in that as I have never worked with mongo. As long as it does not take much more space in size than mysql/mariadb, is comparatibely fast and there is the possibility to port it from our old history databases then, okey. -> If I understand correctly we would still ahve 1 solr server with a solr database for our datafiles and 1 mongoDB with history databse + statistics (optional)
how to effectively run the statiscits along with the plugins, I have no solution. One question: the deamon you were talking about @antarcticrainforest , would it be triggered from the plugin_manager at the time the plugin starts? What do you mean with async in this case?

regarding on where to implement it. For the moment we could patch the code writting some of the Plugin related statistics into the logfile that ends up in the path/to/shared/slurm/<plugin> folder. I guess that at the end we will only need to swap the destination of that info afterwards.

Very temporarily I added some junk code to the wrapper of a plugin to gather some data (runtime, storage usage, files produced) printing it to the logfiles e.g. here. I used psutil for the CPU/Memory usage but I am afraid I do not know how to make it work properly there.

I was looking how to do something similar in the freva core code but I don't find where. I have some doubts where, how to put it.

For what I see in the Plugin Manager whe have the call to:

        result: Optional[utils.metadict] = p._run_tool(
            config_dict=complete_conf,
            unique_output=unique_output,
            out_file=out_file,
            rowid=rowid,
        )

which in turn calls _run_tool

   def _run_tool(
        self,
        config_dict: Optional[ConfigDictType] = None,
        unique_output: bool = True,
        out_file: Optional[Path] = None,
        rowid: Optional[int] = None,
    ) -> Optional[Any]:
        config_dict = self._append_unique_id(config_dict, unique_output)
        if out_file is None:
            is_interactive_job = True
        else:
            is_interactive_job = False
            self._plugin_out = out_file
        for key in config.exclude:
            config_dict.pop(key, "")
        with self._set_environment(rowid, is_interactive_job):
            try:
                result = self.run_tool(config_dict=config_dict)
            except NotImplementedError:
                result = deprecated_method("PluginAbstract", "run_tool")(self.runTool)(config_dict=config_dict)  # type: ignore
            return result

but the run_tool method that is run there is empty as far as I understand. And I do not get very well where in _set_environment could we add some info.

FREVA-CLINT / freva

Gather stats #189