DataONEorg / metrics-service

An efficient database and REST API for delivering aggregated data set metrics to clients.
Apache License 2.0
2 stars 1 forks source link

Ensure Elastic Search indexes stay up to date #81

Open rushirajnenuji opened 3 years ago

rushirajnenuji commented 3 years ago

Ensure Elastic Search indexes eventlog-* and identifiers-* stay up to date.

csjx commented 3 years ago

@rushirajnenuji - To help investigate what look to be low metrics numbers in Elastic Search for the ADC node, I queried the access log table for January and February 2021 with:

SELECT count(docid) FROM access_log al 
      WHERE al.date_logged >= '2021-01-01' 
      AND date_logged < '2021-03-01' 
      AND lower(al.event) = 'read';

Which gives a result of 162,301 raw read events (which includes bots, etc. - no filtering).

Doing the same query in Kibana against the eventlog-* indices, we only get 17,807 hits:

http://localhost:5601/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:'2021-01-01T00:00:00.000Z',mode:absolute,to:'2021-03-01T00:00:00.000Z'))&_a=(columns:!(userAgent,pid,event),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'0576af90-5c30-11e8-acac-67c9290041c8',key:nodeId,negate:!f,params:(query:'urn:node:ARCTIC',type:phrase),type:phrase,value:'urn:node:ARCTIC'),query:(match:(nodeId:(query:'urn:node:ARCTIC',type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'0576af90-5c30-11e8-acac-67c9290041c8',key:event,negate:!f,params:(query:read,type:phrase),type:phrase,value:read),query:(match:(event:(query:read,type:phrase))))),index:'0576af90-5c30-11e8-acac-67c9290041c8',interval:auto,query:(language:lucene,query:''),sort:!('@timestamp',desc))

Looking at the logsolr index, I'm seeing 161,314:

http://localhost:8983/solr/event_core/select?q=nodeId:urn\%3Anode\%3AARCTIC%20AND%20event:read&start=0&rows=1&wt=json&fq=dateLogged%3A%5B2021-01-01T00%3A00%3A00.000000Z+TO+2021-03-01T00%3A00%3A00.000000Z%5D&sort=dateLogged+ASC" | jq .response.numFound

So it looks like the CN log aggregator has missed about 1000 events, but Elastic Search has missed 144,494. I will role the last harvest date back for the ADC repository to try to pick up those missing 1000, and will roll back the Filebeat/Logstash date to the beginning of the year to see if we pick up the missing 144K in the index. If so, all is good, we just had network or scheduled forwarding job issues. If not, we likely have a bug in the metrics service code. I will ask Val to do the same SQL query for ESS-DIVE because they are also seeing low numbers. Thanks!

csjx commented 3 years ago

This relates to #83

csjx commented 3 years ago

Update: Val ran the same query against the Metacat access_log and got:

SELECT count(docid) FROM access_log al
      WHERE al.date_logged >= '2021-01-01'
      AND date_logged < '2021-03-01'
      AND lower(al.event) = 'read';
 count  
--------
 154866
(1 row)

The logsolr core has 154866 documents, so the issue is definitely in the filebeat leg of the pipeline for ESS_DIVE.

Looking at the same query in Elastic Search, there are 0 events. I still need to roll back the filebeat forwarder.

vchendrix commented 3 years ago

@csjx Any update on this?

csjx commented 3 years ago

Hi @vchendrix - Yes, I rolled back the Elastic Search filebeat forwarder files to 2018-01-01 because we also had missing event content from other MNs back to 2018 due to harvesting issues (certs, network, etc.). Looking at Kibana for the ADC from 2021-01-01 to 2021-03-01, the indexed value of raw read events is still 17,807. We have millions of events to process, so I think it will take some time.

For ESS-DIVE, there 5,194 raw read events in ES for that 3 month time frame, but I expect more to be picked up. While we are importing the events via filebeat and logstash, the raw events then get processed into sessions with bots and double-clicks filtered out, etc., and that process is what takes time.

@rushirajnenuji - Can you give an estimate of the number of events that still need processing? I don't recall the ES search to do that.

rushirajnenuji commented 3 years ago

Hi @csjx , I'm seeing 2464 raw events for ESS_DIVE nodeId in ES that still need processing. After filtering out the d1_admin_subject tags, we are left with 706 events - (380 metadata, 326 data reads). (for range - 2021-01-01 to 2021-04-01)

For date range 2021-01-01 to 2021-03-01 those counts are: raw: 2290 unprocessed events (945 data read, 1345 metadata read events) After filtering out d1_admin_subject events: 671 unprocessed events (318 data read, 353 metadata read events)

rushirajnenuji commented 2 years ago

The identifiers-* index had about 30K missing identifiers. Given that this index is primarily used to populate and index portal metrics, it is important that this index stays in sync with DataONE CN.

Current status:

Next steps: