[dw] Time point for a log entry from a given scan

rufuspollock commented 7 years ago

Question: do scans for a given risk always complete with a given day? If not, what day timestamp should be assigned to log entries from that scan?

For example, suppose you have a scan that starts on a monday 31st august and then runs into tuesday 2nd september. That data will all be in the same week but different months. You can’t aggregate direct from weeks to months in a natural way (weeks are not a subset of months). The obvious unit of analysis is therefore days but i am not clear on our logic here.

Should we assign all results from a given scan to the day when the scan began? Do we do that at the moment?

This arose from work on planning the new iteration of the data warehouse in #69:

We need to assign counts per day so we can aggregate up to week and month independently (otherwise we get inconsistently). We can not work with weeks as they do not aggregate up (weeks are not a subset of months).

rufuspollock commented 7 years ago

@chorsley said:

On "do scans for a given risk always complete with a given day": they do currently, but we can't assume that will always be the case

... yes, I think using the first date in each batch of data for assigning to a week would be the way to go. That would account for the case where a long-running scan goes across week boundaries.

rufuspollock commented 7 years ago

@chorsley one question here: what about scans for different risks. Are different risk scans synchronized? If not we may have an issue when someone looks across risks because data for one scan was in one week whilst for another risk it is in a different week.

/cc @aaronkaplan

chorsley commented 7 years ago

@rgrp we should assume:

Different data feeds / scans are not synchronised in any way (and they are not for the most part);
Data collection runs at least once a week;
Data collection may run beyond a single date, or continuously throughout the week.

So, I'd suggest:

We tie a single batch of data to the week of the year, not a single date.
For continuously collected data (don't have this yet, but can expect it), we define a time window that defines the week it's bound to.
We don't integrate data feeds that are collected on a less than weekly basis.

rufuspollock commented 7 years ago

@chorsley wrote:

The issue we have here is that the scans occur on different days that aren't always the start of the week. For example, last week's scans were on:

openntp: 2016-11-18 openssdp: 2016-11-19 opendns: 2016-11-20 opensnmp: 2016-11-22

This is a particular quirk of the open* data sets we initially get, however.

A second issue: later, we'd expect to receive new data feeds that span multiple days because of the length of the scan process, or because it's continuously collected over the course of a week. We'll need some way to tie this to a particular week so the data is comparable between risk types.

So, going to week-of-year as a primary means of date selection would let us easily compare the four above for a start, without the API user needing to do their own complex date calculations requiring knowledge of the peculiarities of each scan data's schedule.

This would likely need to be combined with the idea of fixing all data in a single data batch to a particular week (e.g. of first record in batch), in the case of the scan process bridging across multiple weeks of the year.

aaronkaplan commented 7 years ago

does not matter when the scans starts / ends. Just use the timestamp of the log line. And aggregate by day

rufuspollock commented 7 years ago

FIXED. We have a business decision 😄

cybergreen-net / pm

[dw] Time point for a log entry from a given scan #70