MobilityData / gbfs

Documentation for the General Bikeshare Feed Specification, a standardized data feed for shared mobility system availability. Maintained by MobilityData
https://gbfs.org
Other
785 stars 290 forks source link

Suggestion regarding free_bike_status and bike_id #131

Closed idoco closed 4 years ago

idoco commented 5 years ago

The current GBFS spec, states that the free_bike_status/bike_id are - Unique identifier of a bike.

IMHO, it makes sense to define the minimum duration these identifiers hold, so they are still considered unique identifiers. (i.e. one service day)

rotating bike_ids every couple of minutes, as I noticed some providers do, makes it harder to track bikes and develop meaningful integrations.

Would love to hear your thoughts about this 🤔

mplsmitch commented 5 years ago

I expect that in as much as providers are changing bike_ids, it's to prevent the tracking of bikes to keep the data from becoming personally identifiable and protect user privacy.

idoco commented 5 years ago

How can this information be used to track users?

How can a malicious actor know which user reserved the bike?

jcn commented 5 years ago

@idoco Which providers are doing this now? This is technically not-to-spec so I'd love to hear from any provider that is actually doing this to try to understand the use case here (and why this outweighs the intent of the spec).

mplsmitch commented 5 years ago

@idoco short answer is that with as few as 4 geolocations, users can be identified with >90% accuracy. More on that here: http://news.mit.edu/2013/how-hard-it-de-anonymize-cellphone-data

jcn commented 5 years ago

@mplsmitch But since riders aren't associated with specific bikes at any given time, we're talking about the movement of the bike, not of an actual human, right? The only way I could see this being tied to a person is if ridership and quantity of bikes were so low that the same bike was moving back and forth between locations, being ridden only by a few people.

It's entirely possible I'm forgetting something though.

idoco commented 5 years ago

Thanks, @jcn I noticed this with Bird - https://mds.bird.co/gbfs/louisville/free_bikes (Though I don't know their official status with gbfs).

After some further research, I learned that this has something to do with chargers abuse issues. In Bird's case, if you know the code of the bike, you will be able to "capture" it without physically obtaining it, by spoofing your GPS location.

If bike_ids are unique forever, over time, you can assemble a dictionary of id-to-code, and later identify a remote bird code just by looking at it's bike_id.

Bird have resorted to rotating the bike_id in very short intervals (after every rental), but I would imagine that rotating them once a day should be enough.

Would love to hear the company's feedback about this, and if you have any follow-up question, I'd be happy to research more.


Thanks @mplsmitch, I'm with @jcn on this. This doesn't look like a substantial risk to me. IMHO, access to things like cell towers data opens many easier options to track users.

mplsmitch commented 5 years ago

@jcn in the example I linked to they were able use unidentified cell data along with public records and connect them to individuals. Whether or not that's possible with dockless vehicle coords remains to be seen. All I can tell you there's been some industry concern and discussion on the topic. Nobody seems to know where it will go but they're generally cautious after seeing Zukerberg talk to Congress.

barbeau commented 5 years ago

If bike_ids are unique forever, over time, you can assemble a dictionary of id-to-code, and later identify a remote bird code just by looking at it's bike_id.

The right way to fix this would be to rotate the codes after every rental, but I'm guessing it's a physical lock and they can't remotely do this?

fruminator commented 5 years ago

If the vehicle ID is consistent over time, then by watching GBFS you can find the exact lat/lon start and end of each ride. so, what happens when the ride that ended at the destination of interest (eg mosque, planned parenthood, competing employer, etc) started in front of your house?

I agree with @barbeau 's approach: to rotate the code after the ride starts, so that by the time the vehicle ID shows up again in the feed after the ride ends, its a different ID.

idoco commented 5 years ago

@fruminator, you raise an interesting point, but rotating bike_ids won't ensure user privacy.

Because bike_ids are rotating after every ride, I found a simple way to track bike trips:

  1. Collect bike locations at a 1-minute interval and track bike_id changes (I call these steps).
  2. Look at step 1 you have coordinates for bike_ids [a, b, c]
  3. At step 2 you see locations only for bike_ids [a, b] - This means that bike c is on the move.
  4. At step 3 you see locations for bike_ids [a, b, d] - This means that bike c id was rotated to d and we know its new location.

(There are more tricks involved in the actual implementation, but this is it at high level)

I think that the only real way to protect user privacy is to trim the lat/lng data to some extent. Maybe the best privacy we can get to, would be comparable to watching someone go on a bus that stops near a planned parenthood clinc, without knowing if they actually go there.

Would love to hear your thoughts about this.


@barbeau, yes that would make sense, but Bird codes are physically printed on the scooter handle for users to scan them, so they can't be rotated.

image

fruminator commented 5 years ago

I don't think what you described is representative of what will be happening most of the time, since in most cases a given provider in a given city will have > 1 rental happening at a given time. But, your point does expose probabilistic situations (looking over space and time) where what I described wouldn't protect all info. good point.

idoco commented 5 years ago

Thanks @fruminator, I find this to be a very interesting problem.

At the end/start of the day, there are many rides that do not overlap with other rides. If the problem we are talking about is a real concern, riders to sensitive places should be warned not to ride during the low-hours.

In addition, there are many ways to greatly improve the probability of identifing parallel rides by combining meta-data such as distance between the start/end points, the duration of the ride and a battery level at the end of the ride.

Might be really interesting to get some tagged rides data, strip it from the bike codes, and run this analysis to try and recreate the rides.

morganherlocker commented 5 years ago

Crossposting for visibility, I have opened a related proposal to remove bike_id, along with a demonstration of risk. Happy to have 👀 and feedback on the new issue. 🙇

https://github.com/NABSA/gbfs/issues/146

heidiguenin commented 4 years ago

After discussing bike_id use cases and concerns at a GBFS developers' workshop last week (see a summary here), the consensus of consumers and producers in the room was to support PR #147 alongside the future provision of best practices around bike_id rotation. We're hoping we can solidify support around it as the next step forward. It would be great to get all of your input over there on the solution that has emerged so far as a "minimum viable proposal".

That said, I'd like to propose we close this issue and move relevant conversation there.

@idoco @mplsmitch @jcn @morganherlocker @sven4all @fruminator @barbeau