Ingesting data into Splunk

miquelduranfrigola commented 1 year ago

Background

We have set up a physical Splunk server with the goal of keeping track of all model precalculations. This is a critical step as we scale up the Ersilia Model Hub. For now, the server is not publicly accessible. Please reach out to @miquelduranfrigola if you want to know more.

Ersilia Model TA App

The Ersilia Model TA App is the main monitoring app for the models precalculations in Ersilia. The app ingests data from a remote device. A receiving port (9997) was opened on the all-in-one instance.

Remote device with Splunk Universal Forwarder

Splunk Universal Forwarder is a lightweight version of Splunk used to send data. The following two apps should be placed on every machine with a Splunk Universal Forwarder. All model outputs should go to that same folder structure on each machine to avoid having to modify the inputs app.

ersilia_all_outputs: Sends data to the public IP of the AIO instance over port 9997.
ersilia_inputs: Collects data from all files in /var/log/ersilia_data.
- Limits data ingest to .csv, .log, and .json file types.
  - Lake: eos5axz_lake.csv
  - Logs: eos5axz.log
  - Metadata: eos5axz.json

An example folder containing this data is available here. Data on this folder should be placed in /var/log/ersilia_data

At the moment, this is installed in a local computer. The goal of this issue is to migrate the ersilia_data folder to the cloud.

Ersilia CLI logger

We have written an RunLogger class that creates an ersilia_runs folder with the data in the format required by Splunk.

In practice, the ersilia_runs folder has exactly the same structure than the ersilia_data folder.

Next steps

My suggestion would be to have a S3 bucket where we store the ersilia_runs. Therefore, at the end of every run (locally, in GitHub Actions, etc.) there data will be uploaded to S3 (if permissions are available). Then, Splunk will monitor the S3 bucket and, whenever a change is made to the S3 bucket, it will be ingested and reflected in the dashboard that we already have.

miquelduranfrigola commented 1 year ago

Update

I have written a placeholder class S3Logger that will be able to upload lake calculations to S3.

Next steps:

[x] Create an S3 bucket dedicated to the ersilia runs (ersilia-models-runs).
[ ] Complete the S3Logger class, including credentials.
[ ] Link the S3 bucket to Splunk. Here is some information.

honeyankit commented 1 year ago

@miquelduranfrigola You solution is robust here. Since the logs will increase, we need to also consider the below points in future:

Costs: AWS S3 charges for data storage and data transfer (particularly for data transferred out of S3). Splunk also charges based on the volume of data ingested per day. Make sure you understand these cost structures and that they fit within your budget.
Latency: There might be some delay (latency) between when data is written to S3 and when it appears in Splunk, particularly if the volume of data is large. If latency is not concern then we can remove this.

Q. Why we are not writing directly to the Splunk? Again, I am thinking from the perspective of latency.

miquelduranfrigola commented 1 year ago

Thanks @honeyankit

These are very important points. The way we have set up Splunk is that it will ingest data cumulatively, so, actually, we can overwrite the logs every time we do a calculation, and Splunk will just ingest the updated log. This means that, from an S3 perspective, costs will be stable and negligible. I hope this makes sense?

In terms of latency, fortunately it is not a concern. We will use the Splunk server mainly for stats purposes (for example, to know how much use we've done of each model, or to provide usage statistics to our funders). So latency is really not a constraint. We will produce this reports on a monthly basis, or even every three months.

The reason why we are using S3 as an "intermediate" is because folks at Splunk already set up the tool for us to read from an "always accessible" folder structure. I don't know how this could be adapted to GitHub Actions. Perhaps more importantly, many times we make calculations from outside GitHub Actions, so in this case S3 becomes a centralized place to deposit the log data.

miquelduranfrigola commented 9 months ago

Update: Harvard T4SG volunteers took over this task and they have delivered a neat solution.

Basically, Ersilia now contains a --track flag that allows us to upload files to an S3 bucket which is eventually monitored by Splunk.

I am closing this issue for now.

ersilia-os / ersilia