cybergreen-net / pm

Tech project management repo (issue tracker only)
2 stars 1 forks source link

[dw] Time point for a log entry from a given scan #70

Closed rufuspollock closed 7 years ago

rufuspollock commented 7 years ago

Question: do scans for a given risk always complete with a given day? If not, what day timestamp should be assigned to log entries from that scan?

For example, suppose you have a scan that starts on a monday 31st august and then runs into tuesday 2nd september. That data will all be in the same week but different months. You can’t aggregate direct from weeks to months in a natural way (weeks are not a subset of months). The obvious unit of analysis is therefore days but i am not clear on our logic here.

Should we assign all results from a given scan to the day when the scan began? Do we do that at the moment?

This arose from work on planning the new iteration of the data warehouse in #69:

rufuspollock commented 7 years ago

@chorsley said:

On "do scans for a given risk always complete with a given day": they do currently, but we can't assume that will always be the case

... yes, I think using the first date in each batch of data for assigning to a week would be the way to go. That would account for the case where a long-running scan goes across week boundaries.

rufuspollock commented 7 years ago

@chorsley one question here: what about scans for different risks. Are different risk scans synchronized? If not we may have an issue when someone looks across risks because data for one scan was in one week whilst for another risk it is in a different week.

/cc @aaronkaplan

chorsley commented 7 years ago

@rgrp we should assume:

So, I'd suggest:

rufuspollock commented 7 years ago

@chorsley wrote:

The issue we have here is that the scans occur on different days that aren't always the start of the week. For example, last week's scans were on:

openntp: 2016-11-18 openssdp: 2016-11-19 opendns: 2016-11-20 opensnmp: 2016-11-22

This is a particular quirk of the open* data sets we initially get, however.

A second issue: later, we'd expect to receive new data feeds that span multiple days because of the length of the scan process, or because it's continuously collected over the course of a week. We'll need some way to tie this to a particular week so the data is comparable between risk types.

So, going to week-of-year as a primary means of date selection would let us easily compare the four above for a start, without the API user needing to do their own complex date calculations requiring knowledge of the peculiarities of each scan data's schedule.

This would likely need to be combined with the idea of fixing all data in a single data batch to a particular week (e.g. of first record in batch), in the case of the scan process bridging across multiple weeks of the year.

aaronkaplan commented 7 years ago

does not matter when the scans starts / ends. Just use the timestamp of the log line. And aggregate by day

rufuspollock commented 7 years ago

FIXED. We have a business decision 😄