ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
202 stars 128 forks source link

Ingesting data into Splunk #706

Closed miquelduranfrigola closed 9 months ago

miquelduranfrigola commented 1 year ago

Background

We have set up a physical Splunk server with the goal of keeping track of all model precalculations. This is a critical step as we scale up the Ersilia Model Hub. For now, the server is not publicly accessible. Please reach out to @miquelduranfrigola if you want to know more.

Ersilia Model TA App

The Ersilia Model TA App is the main monitoring app for the models precalculations in Ersilia. The app ingests data from a remote device. A receiving port (9997) was opened on the all-in-one instance.

Remote device with Splunk Universal Forwarder

Splunk Universal Forwarder is a lightweight version of Splunk used to send data. The following two apps should be placed on every machine with a Splunk Universal Forwarder. All model outputs should go to that same folder structure on each machine to avoid having to modify the inputs app.

An example folder containing this data is available here. Data on this folder should be placed in /var/log/ersilia_data

At the moment, this is installed in a local computer. The goal of this issue is to migrate the ersilia_data folder to the cloud.

Ersilia CLI logger

We have written an RunLogger class that creates an ersilia_runs folder with the data in the format required by Splunk.

In practice, the ersilia_runs folder has exactly the same structure than the ersilia_data folder.

Next steps

My suggestion would be to have a S3 bucket where we store the ersilia_runs. Therefore, at the end of every run (locally, in GitHub Actions, etc.) there data will be uploaded to S3 (if permissions are available). Then, Splunk will monitor the S3 bucket and, whenever a change is made to the S3 bucket, it will be ingested and reflected in the dashboard that we already have.

miquelduranfrigola commented 1 year ago

Update

I have written a placeholder class S3Logger that will be able to upload lake calculations to S3.

Next steps:

honeyankit commented 1 year ago

@miquelduranfrigola You solution is robust here. Since the logs will increase, we need to also consider the below points in future:

Q. Why we are not writing directly to the Splunk? Again, I am thinking from the perspective of latency.

miquelduranfrigola commented 1 year ago

Thanks @honeyankit

These are very important points. The way we have set up Splunk is that it will ingest data cumulatively, so, actually, we can overwrite the logs every time we do a calculation, and Splunk will just ingest the updated log. This means that, from an S3 perspective, costs will be stable and negligible. I hope this makes sense?

In terms of latency, fortunately it is not a concern. We will use the Splunk server mainly for stats purposes (for example, to know how much use we've done of each model, or to provide usage statistics to our funders). So latency is really not a constraint. We will produce this reports on a monthly basis, or even every three months.

The reason why we are using S3 as an "intermediate" is because folks at Splunk already set up the tool for us to read from an "always accessible" folder structure. I don't know how this could be adapted to GitHub Actions. Perhaps more importantly, many times we make calculations from outside GitHub Actions, so in this case S3 becomes a centralized place to deposit the log data.

miquelduranfrigola commented 9 months ago

Update: Harvard T4SG volunteers took over this task and they have delivered a neat solution.

Basically, Ersilia now contains a --track flag that allows us to upload files to an S3 bucket which is eventually monitored by Splunk.

I am closing this issue for now.