gdcc / dataverse-kubernetes

Simple to use Dataverse container images and Kubernetes objects
http://k8s-docs.gdcc.io
Apache License 2.0
27 stars 26 forks source link

Add log persistance for Make Data Count #156

Open poikilotherm opened 4 years ago

poikilotherm commented 4 years ago

Since upstream release 4.18, you can just switch on the logging for Make Data Count. We should persist those files somehow, so we can settle on how to process it later.

Maybe use sidecar to suck up the logs and store them somewhere safe instead of storing on a volume?

qqmyers commented 4 years ago

What's the reason for persistence (more than a volume)? Once these are processed, the results are back in Dataverse tables. Is the intent to allow reprocessing in the future?

pdurbin commented 4 years ago

This is a guess but perhaps @poikilotherm is thinking about multiple Glassfish instances. There's a note about Make Data Count at http://guides.dataverse.org/en/4.19/installation/advanced.html#multiple-glassfish-servers

poikilotherm commented 4 years ago

@qqmyers and @pdurbin thanks for asking and getting in touch.

My idea behind shipping those logs away from containers is indeed about scaling, but also about avoiding too much persistance with the Dataverse app. IMHO those logfiles are similar to access logs and those shouldn't be part of the applications persistance (which makes things overly complex, too many volumes to handle), but be part of a log stack ASAP.

IMHO it makes more sense to handle such logs the same way you do nowadays with access logs etc: use things like ELK stack or similar for ingest. Query the index later to grasp the data. We might even think of pushing things into a separate Solr core, as it is already present at any Dataverse installation.

Feeding the index from log files written to disk/memory is really easy with sidecar containers, using tools like logstash/beats or fluentd.