cybergreen-net / pm

Tech project management repo (issue tracker only)
2 stars 1 forks source link

Deduplicating IPs within different time granularities #75

Open zelima opened 7 years ago

zelima commented 7 years ago

Currently IPs are deduplicated on week level, meaning - if there is same IP within same week and risk - it is ignored (not counted).

When this is done on day level, it will result deduplicated IPs within same day. Meaning - simple Rollup by date just won't work, cause sum of daily counts, for any time granularity, won't be accurate (they'll include tons of duplicated IPs)

This may lead to create different fact tables for different time granularities.

rufuspollock commented 7 years ago

@aaronkaplan can you tell us what you think here? Relates back to #42

kxyne commented 7 years ago

Across different feeds? There is only one file per week per feed. We're basing everything on a week, so if you are doing a month it needs to be the average of the weeks that fall in that month. I'm very much in favour of moving to ISO8601 weeks to make all of this clearer.

zelima commented 7 years ago

@kxyne what do you mean by "average of weeks that fall in that month"? Could you give the result output for this example scan results:

# data loaded into table
ts,ip,risk_id,asn,cc
2016-11-28T00:00:00+00:00,111.11.111.11,2,33333,GB 
2016-11-28T00:00:00+00:00,555.55.555.55,2,33333,GB 
2016-12-05T00:00:00+00:00,111.11.111.11,2,33333,GB  
2016-12-12T00:00:00+00:00,111.11.111.11,2,33333,GB
2016-12-19T00:00:00+00:00,111.11.111.11,2,33333,GB 

Results I think should be:

In this scenario we can not just sum up monthly counts to get result for quarter or year

zelima commented 7 years ago

@kxyne Also reason we are switching aggregation on day level is that there may be case when scan won't end within same day - it may start in one month and end in next, but within same week. Eg: scan started on 2016-12-31 an ended on 2017-01-01 - same week, but different month, quarter and year

kxyne commented 7 years ago

Somehow I didn't get the update on this one. For sake of argument I'd probably assign weeks to the month that their Wednesday falls in, the ISO standard has no guide for this. We've been looking at how we process in the ETL when it comes to weeks (things like "which file do I pick up?" and using the ISO week that days fall in makes it all nicely aligned across the feeds, otherwise it'll sawtooth between threats based on the day the scan happens on.

zelima commented 7 years ago

@kxyne sorry for late respond... Ok, think I've got you re week/month splits. What about deduplicating? What if the same IP within the same risk comes up in 3 different weeks - should they be counted as 3 or 1, when let's say we want to see annual results?

kxyne commented 7 years ago

Well we deduplicate by the week only, and we average the weeks across a month/year, not cumulative counts.

Either that or we re-do everything and pre-aggregate on the back end before deduplication but that could be a later enhancement.

kxyne commented 7 years ago

The way I see it we're interested in trends, not counts, but maybe we should throw this to the stats group?

zelima commented 7 years ago

@rufuspollock this make me think to switch back to week as base granularity... What would you say?

rufuspollock commented 7 years ago

@aaronkaplan @kxyne this is something to discuss on next tech team call.

zelima commented 7 years ago

@rufuspollock @aaronkaplan @kxyne we should include this into the list of items to be discussed on next team call. Moving to current milestone

rufuspollock commented 7 years ago

Moving to backlog as this needs a team discussion.