Excess detail consuming disk resources

DataONEorg / metrics-service

An efficient database and REST API for delivering aggregated data set metrics to clients.

Apache License 2.0

2 stars 1 forks source link

Excess detail consuming disk resources #97

Open datadavev opened 11 months ago

datadavev commented 11 months ago

Disk usage on logproc-stage-ucsb-1.test.dataone.org is running close to 95%. By default, Elastic Search puts itself into read only mode when disk capacity reaches 95% full to avoid errors and complications when disks are full.

The method for recovery is to reduce disk usage and issue the command:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

The vast majority of disk use is with the apacheperf-1 index currently at around 720gb, followed by eventlog-1 at around 144gb.

mbjones commented 11 months ago

Thanks for this, @datadavev -- nick is out, so its hard to expand the filesystem at the moment. But we can later. Over the short term, is there anything we can clean up to gain some headroom? I see about 10GB of log data in /var/log in 3 subdirs:

1999    apache2
4026    elasticsearch
4129    journal

Maybe those can be trimmed some? Other ideas?

datadavev commented 11 months ago

Some space has been freed up and I've slowed the firehose of events in the apacheperf-1 index by excluding events where the CN is calling itself through the API. That reduces the traffic considerably to give some time for a more considered solution. The temporary fix was on cn-ucsb-1 adjust the apache config like:

    #Performance logging
    # don't log self
    SetEnvIf Remote_Addr "128\.111\.85\.180" dontlog
    LogFormat "%{%Y-%m-%d}tT%{%T}t.%{msec_frac}t%{%z}t|%m|%>s|%{ms}T|%a|%U|\"%q\"|%{cache-status}e|\"%{User-agent}i\"|%u" performance_log
    CustomLog "/var/log/apache2/cn_perf.log" performance_log env=!dontlog

This is just a temporary fix to slow the deluge of events. Thing is, I'm not sure this information is needed for the current metrics processing - reviewing code and logstash configuration...