Open zelima opened 7 years ago
@aaronkaplan can you tell us what you think here? Relates back to #42
Across different feeds? There is only one file per week per feed. We're basing everything on a week, so if you are doing a month it needs to be the average of the weeks that fall in that month. I'm very much in favour of moving to ISO8601 weeks to make all of this clearer.
@kxyne what do you mean by "average of weeks that fall in that month"? Could you give the result output for this example scan results:
# data loaded into table
ts,ip,risk_id,asn,cc
2016-11-28T00:00:00+00:00,111.11.111.11,2,33333,GB
2016-11-28T00:00:00+00:00,555.55.555.55,2,33333,GB
2016-12-05T00:00:00+00:00,111.11.111.11,2,33333,GB
2016-12-12T00:00:00+00:00,111.11.111.11,2,33333,GB
2016-12-19T00:00:00+00:00,111.11.111.11,2,33333,GB
Results I think should be:
date,risk,country,asn,count
2016-11-28T00:00:00+00:00,2,GB,33333,2
2016-12-05T00:00:00+00:00,2,GB,33333,1
2016-12-12T00:00:00+00:00,2,GB,33333,1
2016-12-19T00:00:00+00:00,2,GB,33333,1
date,risk,country,asn,count
2016-11-01T00:00:00+00:00,2,GB,33333,2
2016-12-01T00:00:00+00:00,2,GB,33333,1 #count is still 1 cause of same IP within same month
date,risk,country,asn,count
2016-01-01T00:00:00+00:00,2,GB,33333,2
In this scenario we can not just sum up monthly counts to get result for quarter or year
@kxyne Also reason we are switching aggregation on day level is that there may be case when scan won't end within same day - it may start in one month and end in next, but within same week.
Eg: scan started on 2016-12-31
an ended on 2017-01-01
- same week, but different month, quarter and year
Somehow I didn't get the update on this one. For sake of argument I'd probably assign weeks to the month that their Wednesday falls in, the ISO standard has no guide for this. We've been looking at how we process in the ETL when it comes to weeks (things like "which file do I pick up?" and using the ISO week that days fall in makes it all nicely aligned across the feeds, otherwise it'll sawtooth between threats based on the day the scan happens on.
@kxyne sorry for late respond... Ok, think I've got you re week/month splits. What about deduplicating? What if the same IP within the same risk comes up in 3 different weeks - should they be counted as 3 or 1, when let's say we want to see annual results?
Well we deduplicate by the week only, and we average the weeks across a month/year, not cumulative counts.
Either that or we re-do everything and pre-aggregate on the back end before deduplication but that could be a later enhancement.
The way I see it we're interested in trends, not counts, but maybe we should throw this to the stats group?
@rufuspollock this make me think to switch back to week as base granularity... What would you say?
@aaronkaplan @kxyne this is something to discuss on next tech team call.
@rufuspollock @aaronkaplan @kxyne we should include this into the list of items to be discussed on next team call. Moving to current milestone
Moving to backlog as this needs a team discussion.
Currently IPs are deduplicated on week level, meaning - if there is same IP within same week and risk - it is ignored (not counted).
When this is done on day level, it will result deduplicated IPs within same day. Meaning - simple Rollup by date just won't work, cause sum of daily counts, for any time granularity, won't be accurate (they'll include tons of duplicated IPs)
This may lead to create different fact tables for different time granularities.