GlobalFishingWatch / vessel-scoring

Apache License 2.0
14 stars 11 forks source link

Specify fishing in terms of time ranges #63

Closed bitsofbits closed 7 years ago

bitsofbits commented 7 years ago

@redhog , @seacourtaw I were talking about ways to make training data more flexible and we decided to extract the fishing classification information we have in terms of time ranges. Specifically, CSV files with mmsi, start_time, end_time, and is_fishing fields. For example:

mmsi,start_time,end_time,is_fishing
224068000,2012-06-01T00:16:37+00:00,2012-06-09T03:36:35+00:00,1
224068000,2012-06-09T05:58:55+00:00,2012-07-08T14:42:56+00:00,0
224068000,2012-07-08T15:49:55+00:00,2012-07-13T16:10:37+00:00,1
224068000,2012-07-14T00:21:05+00:00,2012-07-14T12:02:37+00:00,0
224068000,2012-07-14T13:39:57+00:00,2012-07-15T15:59:35+00:00,1
224068000,2012-07-16T01:35:37+00:00,2012-07-16T15:12:17+00:00,0
224068000,2012-07-17T00:58:13+00:00,2012-08-20T13:12:26+00:00,1
224068000,2012-08-20T13:13:56+00:00,2012-09-06T04:18:44+00:00,1
224068000,2012-09-06T12:49:44+00:00,2012-09-10T09:32:45+00:00,0

I started roughing out the implementation of this in https://github.com/GlobalFishingWatch/vessel-scoring/tree/convert-fishing-data-to-ranges, although eventually the code will likely live somewhere else.

This format has several advantages:

redhog commented 7 years ago

if we publish this, we can't publish anonymized classified data afterwards, as they would be trivial to correlate...

bitsofbits commented 7 years ago

@redhog , that's a good point. ☹️ . I guess we need to anonymize this as well before we publish it publically.

bitsofbits commented 7 years ago

@redhog , @seacourtaw : thinking about this a bit more, I think it should be sufficient to just fuzz the end of the ranges. What I've implemented is to put the starting edge of the ranges between the last point before the range and first point of the range (but never more than 10 minutes from the start of the range. Similarly for the end of the ranges. I think that this fuzzing, plus the fact that the ranges lump all contiguous points together should be enough to make correlating the ranges to points hard enough that we shouldn't have to worry about it.

I'm have an implementation sketched out here: https://github.com/GlobalFishingWatch/mussidae/pull/4