Duplicate events in metrics service result in inconsistent count

rushirajnenuji commented 3 years ago

After the reports on the metrics displayed for the user interface for ESS DIVE MN and for ADC MN, we investigated the numbers at different stages of the metrics pipeline. Here's the link to those attached reports for 2020 and 2021 (YTD) for ADC and ESS DIVE .

Based on the stats in the above report, it seemed that there were way more events in the ES than reported by the MN (almost twice and in some cases 3 times in magnitude) - causing duplicate entries for the same event in the ES index. Because of a log aggregation bug, we had to reprocess certain events and that looks like the reason for these duplicate entries.

To verify that we have duplicates in our ES index, I wrote up a script selecting a random month (from the above spreadsheet where we suspected the events were duplicated - May 2020) for the ESS DIVE node - and queried for the frequency of entryId associated with these events. The results can be found here - these counts are sorted in descending order. So, from 2020-05-01 to 2020-06-01 (for ESS DIVE) we found that each event was indexed in our system at least 3 times. Ideally, we would want the entryId count to be 1 for each event recorded in the ES. Similar patterns can be seen across other member nodes at different times.

Based on the gist above, I queried a random entryId - 669246 and tried to look up the documents in the ES index, we found that this entryId is just unique across a single MN and not unique across all DataONE MNs - meaning other MNs also had the same entryId generated by their log aggregation service for a different DataONE object. So in order to remove the duplicates from getting reported by the metrics service, we would want a combination of nodeId and entryId.

In order to fix this, we added a new field in the ES index - eventId (keyword ES datatype - this allows us to perform advance aggregations) which follows this pattern {nodeId}:{entryId} (e.g. urn:node:ESS_DIVE:669246) and updated the Logstash config which addresses the issue for new incoming events. In order to address this for the already existing events, we ran a script that goes through the entire index and assigns eventId to all the existing events. This is still in progress and so far we have processed (at the time of filing this ticket) 102.1M events out of a total of 187.2 M events and at the current speed it would take ~12 more hours for the script to go through the remaining events.

Once, we have the eventId assigned, we can use it to filter out duplicating events from getting calculated. For this, we'll be adding to our current ES aggregation query an additional cardinality aggregation filter to only get the unique events based on the value {nodeId}:{entryId}. This change would be in the queries made by metric service to ES and also in the methods that handle the response returned by the ES. This impacts all the types of queries supported by the DataONE metrics service - Dataset landing page, portal metrics, repository metrics, etc. This is currently a work in progress - estimate: 1 day for code changes, 1-2 days for testing before we draft a release.

mbjones commented 3 years ago

Thanks for the great summary, @rushirajnenuji

The solution proposed seems like a good way to know we have a unique primary key for each record. But it also leaves a lot of extraneous duplicate records in ES. In addition to the filter on the unique {nodeId}:{entryId} combos, can you also delete the duplicated entries to just create a smaller dataset? Given the number of triplicated records, might this help our speed a lot too?

rushirajnenuji commented 3 years ago

Thank you for reviewing this @mbjones Yes, Dave also proposed exact same thing. It would make sense to delete the duplicate events and it would help us speed up our aggregation queries too. I'll file a ticket for this and prioritize that as the next step of this issue. Thank you!

DataONEorg / metrics-service

Duplicate events in metrics service result in inconsistent count #86