Aircloak / aircloak

This repository contains the Aircloak Air frontend as well as the code for our Cloak query and anonymization platform
2 stars 0 forks source link

anonymization / Enforced fixed alignment for datetime #681

Closed yoid2000 closed 8 years ago

yoid2000 commented 8 years ago

This issue expands on #676.

For datetime types, alignment takes place as described in #676 (so-called "125" alignment), but on a per-time component basis (second, hour, day, month, and year).

Specifically, given the following datetime syntaxes: YYYY-MM-DD YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss.nnnnnn hh:mm:ss hh:mm:ss.nnnnnn

Each of the separate fields (i.e. YYYY, hh, etc.) individually follow the "125" alignment.

The lower boundary is inclusive (>=), and the upper boundary exclusive (<).

The following examples of valid and invalid queries hopefully makes the above clear:

Years: Valid: col1 >= '2010-01-01' AND col1 < '2011-01-01' col1 >= '2010-01-01 00:00:00' AND col1 < '2011-01-01 00:00:00' col1 >= '2010-01-01' AND col1 < '2015-01-01' Invalid because year isn't "125" aligned: col1 >= '2010-01-01' AND col1 < '2013-01-01'

Months: Valid: col1 >= '2010-01-01' AND col1 < '2010-02-01' col1 >= '2010-12-01' AND col1 < '2011-01-01' col1 >= '2010-01-01 00:00:00' AND col1 < '2010-02-01 00:00:00' col1 >= '2010-02-01' AND col1 < '2010-04-01' col1 >= '2010-05-01' AND col1 < '2010-10-01' col1 >= '2010-12-01' AND col1 < '2011-02-01' col1 >= '2010-10-01' AND col1 < '2011-03-01' Invalid because not "125" aligned: col1 >= '2010-02-01' AND col1 < '2010-05-01' col1 >= '2010-10-01' AND col1 < '2011-02-01' col1 >= '2010-06-01' AND col1 < '2011-06-01' (note that this last one is exactly one year, but is invalid because it is 12 months in length, which is not a valid '125' boundary)

Days: Valid: col1 >= '2010-01-04' AND col1 < '2010-01-05' col1 >= '2010-01-31' AND col1 < '2010-02-01' col1 >= '2010-01-04 00:00:00' AND col1 < '2010-01-05 00:00:00' col1 >= '2010-01-10' AND col1 < '2010-01-12' col1 >= '2010-01-10' AND col1 < '2010-01-20' col1 >= '2010-01-30' AND col1 < '2010-02-09' col1 >= '2010-04-30' AND col1 < '2010-04-10' (note that these last two are both 10-day ranges, but one ends on 09 and the other ends on 10 because January has 31 days, and April has 30) Invalid: col1 >= '2010-01-10' AND col1 < '2010-01-19' col1 >= '2010-01-30' AND col1 < '2010-02-05'

Hours: Valid: col1 >= '2010-01-04 17:00:00' AND col1 < '2010-01-04 18:00:00' col1 >= '2010-01-04 23:00:00' AND col1 < '2010-01-05 00:00:00' col1 >= '2010-01-04 00:00:00' AND col1 < '2010-01-04 05:00:00' col1 >= '2010-01-04 20:00:00' AND col1 < '2010-01-05 06:00:00' Invalid: col1 >= '2010-01-04 20:00:00' AND col1 < '2010-01-05 20:00:00' (note that this one is in fact exactly 24 hours, or one day. But we disallow it because 24 isn't a '125' alignment)

Minutes: Valid: col1 >= '2010-01-04 17:58:00' AND col1 < '2010-01-04 17:59:00' col1 >= '2010-01-04 17:59:00' AND col1 < '2010-01-04 18:00:00' col1 >= '2010-01-04 17:00:00' AND col1 < '2010-01-04 17:50:00' col1 >= '2010-01-04 17:30:00' AND col1 < '2010-01-04 18:20:00' Invalid: col1 >= '2010-01-04 17:30:00' AND col1 < '2010-01-04 18:30:00' (note that this one is in fact exactly one hour, but fails because 60 minutes isn't a '125' alignment)

Seconds: Valid col1 >= '2010-01-04 17:58:12' AND col1 < '2010-01-04 17:58:13' col1 >= '2010-01-04 17:59:59' AND col1 < '2010-01-04 18:00:00' col1 >= '2010-01-04 17:58:00' AND col1 < '2010-01-04 17:58:50' col1 >= '2010-01-04 17:58:30' AND col1 < '2010-01-04 17:59:20' Invalid: col1 >= '2010-01-04 17:58:30' AND col1 < '2010-01-04 17:59:30' (note above invalid even though it is exactly one minute)

Micro- or Nano-seconds: col1 >= '2010-01-04 17:58:12.200000' AND col1 < '2010-01-04 17:58:12.400000' col1 >= '2010-01-04 17:58:12.999999' AND col1 < '2010-01-04 17:58:13.000000'

sebastian commented 8 years ago

Let's do fixed alignment of numbers first, then this. We can use the same strategy as in the numbers case here. Either first enforce the analyst to input the values in a certain way, or alternatively jump the gun directly, and do it behind the scenes.

sebastian commented 8 years ago

You write the following is invalid because it spans a year boundary: col1 >= '2010-06-01' AND col1 < '2011-06-01'. But you also write that each of the components individually follow the 125 alignment? Shouldn't a single year gap then be ok, irrespective of the other components?

Same for months. First you list this as being valid (which crosses year boundary): col1 >= '2010-12-01' AND col1 < '2011-01-01'. Then you list this as invalid (because it crosses year boundary): col1 >= '2010-12-01' AND col1 < '2011-02-01'

Same for days too.

yoid2000 commented 8 years ago

You write the following is invalid because it spans a year boundary: col1 >= '2010-06-01' AND col1 < '2011-06-01'. But you also write that each of the components individually follow the 125 alignment? Shouldn't a single year gap then be ok, irrespective of the other components?

My intent was that any component could individually follow 125 alignment within the bounds of the next higher level component. This is indeed a bit constraining, so we could loosen the constraint if we wish.

Are you suggesting that each component can follow 125 regardless of constraints imposed by the next higher component, or are you just asking for clarification? If the former, then we need to think through what defines the '0' value for each component.

yoid2000 commented 8 years ago

First you list this as being valid (which crosses year boundary): col1 >= '2010-12-01' AND col1 < '2011-01-01'. Then you list this as invalid (because it crosses year boundary): col1 >= '2010-12-01' AND col1 < '2011-02-01'

col1 >= '2010-12-01' AND col1 < '2011-01-01' is the same as col1 >= '2010-12-01' AND col1 <= '2010-12-31', so doesn't cross year boundary.

(The numeric fixed alignment is always >= and <, so I was sticking to this convention.)

sebastian commented 8 years ago

Are you suggesting that each component can follow 125 regardless of constraints imposed by the next higher component, or are you just asking for clarification? If the former, then we need to think through what defines the '0' value for each component.

Clarification (understanding your intent) more than anything.

col1 >= '2010-12-01' AND col1 < '2011-01-01' is the same as col1 >= '2010-12-01' AND col1 <= '2010-12-31', so doesn't cross year boundary.

Well, even so you list it as invalid in your list above? Which confuses me

yoid2000 commented 8 years ago

I list col1 >= '2010-12-01' AND col1 < '2011-01-01' as valid...

Maybe we should talk by phone...

sebastian commented 8 years ago

Invalid because crosses year boundary: col1 >= '2010-12-01' AND col1 < '2011-02-01'

This is copied and pasted from your original issue above.

We could talk on the phone, but we should really make sure it is all well documented here too, so it can be implemented.

yoid2000 commented 8 years ago

col1 >= '2010-12-01' AND col1 < '2011-02-01' is invalid because it goes from the beginning of december to the end of january.

sebastian commented 8 years ago

Which is an interval of two months, meaning 125 aligned. I think we should probably 125 align the interval, rather than the date construct.

i.e. the interval has to span 125 years, 125 months, 125 days?

yoid2000 commented 8 years ago

Which is an interval of two months, meaning 125 aligned.

My intent was to disallow any range that spans the natural boundary of the next higher level component. So far as I know, the issue is correctly documented for that.

I think we should probably 125 align the interval, rather than the date construct.

I would consider a full 125 alignment in addition to the natural date boundaries, but we must allow natural date boundaries. My initial implementation of fixed alignment dates did not have those boundaries, and it was completely unuseable. The whole point of "125" (with numbers, not dates) boundaries is useability. It is natural for an analyst to say 100 <= X < 200. It is not natural to say 1990-01-01 <= X < (whatever 1990-01-01 plus 100 months is)

sebastian commented 8 years ago

Aligning purely within a date-parameter is also everything but easy to use. Neither months (12), nor days (28, 30, 31) align nicely with 125, meaning you are going to have problems making any sensible queries that are about the upper end of a component range.

yoid2000 commented 8 years ago

You are right...I was being dense. If an analyst wants (roughly) 100 months, they would naturally use years not months (1990-01-01 <=X < 2000-01-01).

Ignoring next-higher-level component boundaries, as you suggest, is indeed strictly simpler than enforcing them.

I'll fix the issue description to capture this...

yoid2000 commented 8 years ago

Ok, I updated the issue description, but there is still what might be an unexpected behavior. Namely, the following two queries are (currently) listed as invalid:

col1 >= '2010-01-04 17:00:00' AND col1 < '2010-01-05 17:00:00' col1 >= '2010-01-04 17:58:30' AND col1 < '2010-01-04 17:59:30'

The first one is exactly 24 hours (1 day), and the second on is exactly 60 seconds (1 minute).

The first one is labeled as invalid, however, because we expect "day" units to be from midnight to midnight (00 hour to 00 hour), but this one is from 17 hour to 17 hour). Since this doesn't fall on midnight boundaries, we treat it as hours and not days. But 24 (hours) is not a valid '125' alignment, so it is invalid.

Similarly with the second one, we expect minutes to fall on 00 second boundaries, but these fall on 30 second boundaries. As a result, we treat is as seconds, and 60 is not '125' aligned.

What do you guys think? I think it probably makes more sense to allow the above.

sebastian commented 8 years ago

Puke, for my upcoming suggestion, but maybe:

?

sebastian commented 8 years ago

Or maybe not exactly 1-3-6-12, but some other scheme than 125?

yoid2000 commented 8 years ago

I agree that '125' doesn't help much for datetimes. I wanted to use it to keep things simple for the analyst (i.e. only has one style of alignment to deal with), but maybe we are shooting ourselves in the foot here.

I think that a natural alignment for minutes and seconds would be 1-5-10-15-30

For months 1-3-6-12 works well.

Days are a mess...

yoid2000 commented 8 years ago

Sorry, 1-5-10-15-30 is good for minutes and seconds, not hours and minutes. Previous comment edited as such.

yoid2000 commented 8 years ago

For hours, 1-2-4-8-12 ?

And for minutes and seconds, we should have 2 as well, so 1-2-5-10-15-30...

Let me come up with some examples to see how these feel...

yoid2000 commented 8 years ago

Ok, now I have a better sense of where my intuition of not crossing natural time boundaries comes from.

Speaking of minutes for the time being, it seems to me perfectly natural to snap minutes to 1,2,5,10,15, and 30 minute boundaries. so one could have 2:15 - 2:30, or 4:50 - 5:00, but not 4:15 - 4:25 or 5:55 - 6:10. If we do this way, then we never cross the hour boundary, but that seems fine.

Likewise for hours, we could snap to 1, 2, 4, 8, and 12. So we could have 12:00 - 16:00, or 12:00 - 24:00 (or maybe more like '2016-03-15 12:00:00' - '2016-30-16 00:00:00'), but not '2016-03-15 22:00:00' - '2016-30-16 02:00:00'. Again, there doesn't seem much need to cross day boundaries.

What do you think?

sebastian commented 8 years ago

Again, there doesn't seem much need to cross day boundaries.

I am not sure I agree with this statement. There might be any number of reasons to want to look for entities happening around a minute/day/hour boundary? Or more specifically, I see no reason why the time interval 23:20 - 23:40 is any more interesting than say 23:50 - 00:10 (which snaps to half the interval like in 125)?

I assume/suggest we would do automatic range adjustments and snapping just like we are already doing it for numerical ranges today. The ranges that are being snapped to should therefore seem logical to the analyst, but I don't think it is necessary that they could easily do it themselves? Therefore allowing snapping to half-ranges like with 125, and allowing these to cross boundaries seems OK (and in fact necessary) to me.

yoid2000 commented 8 years ago

I am not sure I agree with this statement. There might be any number of reasons to want to look for entities happening around a minute/day/hour boundary? Or more specifically, I see no reason why the time interval 23:20 - 23:40 is any more interesting than say 23:50 - 00:10 (which snaps to half the interval like in 125)?

If we are snapping minutes to 1,2,5,10,15, and 30, then in any event 23:20 - 23:40 is not allowed.

I don't offhand see any loss of granularity at the hour boundary here, but maybe I'm missing something? Can you give an example?

PF

sebastian commented 8 years ago

Ok, say I want to know how many users log in / submit their homework / visit the doctors office around midnight. A sensible time range might then be, 23:55 - 00:05? Or even 23:50 - 00:10 or 23:45 - 00:15?

yoid2000 commented 8 years ago

Ok, say I want to know how many users log in / submit their homework / visit the doctors office around midnight. A sensible time range might then be, 23:55 - 00:05? Or even 23:50 - 00:10 or 23:45 - 00:15?

Yes, but you can take this argument ad infinitum. If you need offsetting of the range by 1/2, then why not by 1/4, or 1/3, or whatever?

If the analyst really wants 23:55 - 00:05, then he would have to make two queries and add the results.

All I'm saying is that snapping to 1,2,5,10,15,30 is a reasonable compromise between simplicity for the analyst and adequate granularity. If you think we need more granularity, then go ahead. Keep in mind though that the shifting and/or finer granularity increases the number of checks that shrink-and-drop has to make...

sebastian commented 8 years ago

Yes, but you can take this argument ad infinitum. If you need offsetting of the range by 1/2, then why not by 1/4, or 1/3, or whatever?

That is missing the point. The point is that constructing a system where it is impossible to cross the minute/hour/day/year boundary is horribly restrictive. Allowing shifting by 1/2 makes it bearable. Certainly not nice, but bearable.

If we don't, then only way the range 2016-12-31 23:45:00 to 2017-01-01 00:15:00 for calls happening around new years could be automatically aligned by our system would be to the nearest full year?

If the analyst really wants 23:55 - 00:05, then he would have to make two queries and add the results.

@yoid2000, this can never be a valid solution/answer. We should make our analyst lives easier, not a misery.

yoid2000 commented 8 years ago

ok

sebastian commented 8 years ago

Ready for implementation, but please think through alignments before starting implementation

sebastian commented 8 years ago

Also note that the examples at the top, are no longer accurate, given different alignment schemes.

yoid2000 commented 8 years ago

One thing to keep in mind. If and when the Cloak implements now, this will impact the datetime fixed alignment. In particular, you can't specify a time later than now, because repeated queries effectively cause the right-hand time boundary to be incrementally increased. So you need to pick a datetime earlier than now.

sebastian commented 8 years ago

This has become quite complex. Here is an attempt at simplifying the scheme somewhat. It has the problem of being somewhat harder to reason about as an analyst. But since we are internally aligning in the cloak, it might be a decent tradeof, since the cloak implementation at least should be doable.

Here is the concept, for your feedback:

Align on largest interval component

Approach:

@yoid2000, @obrok thoughts?

yoid2000 commented 8 years ago

COuld you give a couple examples to clarify what is going on here?

obrok commented 8 years ago

Do I understand correctly that this would ban intervals that cross a higher-level boundary? So 2015-12-01 - 2016-01-31 would get aligned to 2015-01-01 - 2017-01-01 or something?

sebastian commented 8 years ago

COuld you give a couple examples to clarify what is going on here?

Heheheh – I hoped you wouldn't ask.

Ok, here are two examples. One is slightly more straight forward, and the other more complex.

Do I understand correctly that this would ban intervals that cross a higher-level boundary? So 2015-12-01 - 2016-01-31 would get aligned to 2015-01-01 - 2017-01-01 or something?

No, that wasn't my intention. In your case the highest interval component with a change is month, and the interval spans 2 months. 2015-12 is month 552 since epoch, so we have to range adjust 552 - 554 (again cheating and doing 1-2-5) which already is aligned. Hence the the original range is fine.

yoid2000 commented 8 years ago

Oh, you really meant unix epochs. I really think we should avoid epochs. It is completely confusing for analysts.

I guess what you are trying to do here is to be able to assign reasonable alternate ranges for those cases where an analyst crosses a natural time-unit boundary.

I wonder if the "making change" idea of #766 doesn't offer an alternative. For example, suppose that when an analyst gives a range, if it isn't fixed-aligned, then the cloak finds the smallest set of fixed-aligned ranges that compose, or nearly compose, the requested range.

So, for @obrok 's example, if the analyst says 2015-12-01 - 2016-01-31, the cloak sees that as two fixed-aligned one-month ranges, and then makes one bucket per range.

sebastian commented 8 years ago

Oh, you really meant unix epochs. I really think we should avoid epochs. It is completely confusing for analysts.

unix epochs is just in order to give some sort of fixed grid that can be aligned against. It can be any grid, doesn't have to be unix epoch, but I think it is as good as any? My hope is that the analyst will hardly ever have to think about this. Most sensible ranges will work, and others will be adjusted well enough.

Since we are aligning on the largest component that changed, the actual range being adjusted tends to be quite small (as in the examples) and hence aligns quite neatly onto a grid. As you see, @obrok's example even fit perfectly 👍

Even if we implement #766, it seems we need an internal grid to align to first (despite this being something that the analyst never needs to care about), right? So #766 would in any case be a second step, rather than a first one. That leaves us with the question of how the initial solution should look.

yoid2000 commented 8 years ago

COuld you give a couple examples to clarify what is going on here?

On 10/21/2016 1:42 PM, Sebastian Probst Eide wrote:

This has become quite complex. Here is an attempt at simplifying the scheme somewhat. It has the problem of being somewhat harder to reason about as an analyst. But since we are internally aligning in the cloak, it might be a decent tradeof, since the cloak implementation at least should be doable.

Here is the concept, for your feedback:

Align on largest interval component

Approach:

  • always work on the interval between the two dates
  • take largest component containing a change (be it year, month, day, etc...)
  • fix align that component on a grid of that unit since epoch, using the schema: o year: 1-2-5 o month: 1-2-6 o day: 1-2-5 o hour: 1-2-6-12 o minute: 1-2-5-15-30 o second: 1-2-5-15-30 o less than a second: 1-2-5
  • convert back to dates

@yoid2000 https://github.com/yoid2000, @obrok https://github.com/obrok thoughts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircloak/aircloak/issues/681#issuecomment-255359905, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qTZ_EdhA53pP6SkuaDk97YXiI8E9ks5q2KUkgaJpZM4KIcZi.

sebastian commented 8 years ago

@yoid2000 is this an unintentional repost? Do you need examples going beyond those posted above?

yoid2000 commented 8 years ago

@sebastian yes, unintentional...