Closed morganherlocker closed 5 years ago
That this kind of analysis can be made based on this data is not new to me (did a similar analysis last year on comparable data https://www.youtube.com/watch?v=MVqJtJA6_wg). Can you please clarify why you think this is a big vulnerability and for whom? (I can come up with some scenarios, but in my opinion the risk is minor, but if you indicate the scenarios you are worried about we can have a more in depth discussion).
Please note that this change is breaking implementations based on previous versions of GBFS. Another problem is that the GPS location is sometimes drifting, therefor it is not longer possible to determine how long bikes / scooters are parked on a certain location (that is a use-case used by multiple municipalities over the world). Also the OD matrix you derive is not 100% accurate, relocations are also considered as trips for example.
GBFS is an excellent dataset for municipalities to monitor and control micromobility operators (and this time municipalities are early in the game compared with the earlier disruptions started by Uber and AirBnB for example). I am a bit worried that privacy is used as an easy argument to not share the data, even if the risks are relatively small (the operators themselves have much more privacy sensitive data).
Thanks for the questions and feedback @sven4all. 🙇
GBFS is an excellent dataset for municipalities to monitor and control micromobility operators (and this time municipalities are early in the game compared with the earlier disruptions started by Uber and AirBnB for example). I am a bit worried that privacy is used as an easy argument to not share the data, even if the risks are relatively small (the operators themselves have much more privacy sensitive data).
100% with you on this. CITIES NEED THIS DATA! Legitimate researchers and planners need this data. Regulators need this data. These institutions have legal and ethical guidelines for handling sensitive information designed to prevent abuse. The issue is when individual behavior is exposed to the entire internet, in which case the PII should be obscured or the feed should require authentication. MDS and other feeds containing sensitive data address this using authentication, and it works great. Using authentication, MDS is able to send even more granular telemetry to cities, allowing them to regulate more effectively, and I am in full support of this practice.
Similar historic datasets have been used to track celebrities, and the real time nature of this feed enables more dangerous stalking scenarios. For everyday people, the possibilities are similar for domestic abuse, evidenced by the alarming ecosystem of covert apps designed to track the movements of domestic partners. This is a big enough issue that national organizations have emerged to combat domestic abuse & stalking using location and other forms of telemetry through advocacy and education, including the National Network to End Domestic Violence.
Vulnerable populations can be targeted by focusing on specific points of interest, as well as other patterns in the data. Muslim drivers were identified en masse when a large taxi dataset that tried to obscure the vehicle id accidentally made the hash reversible. Using points of interest, an anti-abortion group could generate lists of residential addresses that linked to a Planned Parenthood, then cheaply target these addresses with harassment (automated mailers, door to door, etc.). Similar types of harassment could be used against people attending immigration lawyers, medical facilities, political gatherings, methadone clinics, protests, synagogues, or mosques. Researchers have successfully re-identified individuals traveling to strip clubs, opening up the possibility for extortion across wide populations, not to mention the obvious risks to employees leaving the establishment with their destination revealed to patrons in real time. Perhaps the most likely scenario, businesses could use this information to send marketing materials to addresses of vulnerable people, such as loan ads & scams to people visiting payday lenders. Origin destination data is already used by numerous tech marketing firms for ad targeting and psychographic modeling. This data is in such high demand, that a secondary market of fraudulent trip data has even emerged. Adtech companies want this data and are scrambling to acquire it just about any way they can.
I'll note that I intentionally did not attempt to demonstrate any re-id exploits in this issue, since these techniques are well established elsewhere. What I did was show that trip reconstruction creates the same type of data that can be used for the purposes I have referenced, all of which have already been demonstrated with similar exposed datasets. I have personally seen non-disclosed successful re-ids with similar datasets, and it is particularly easy in places like Beverly Hills, full of spread out houses, interesting people, and public websites with name-address-linked data. A lot of the existing re-id research has focused on NYC, largely because it was the simplest open trip dataset to acquire as of 2014-2016 (it was also patched summer of '16), but NYC is more difficult than most places for re-id. Obviously, a few insiders (anyone reading this) know about the GBFS issue already, but it is likely to attract substantial interest overtime if left unpatched.
Please note that this change is breaking implementations based on previous versions of GBFS.
I have seen others propose a rotating ID. Some providers are already rotating IDs in practice. I will write a second PR that implements rotating IDs per status change, so that vehicles can be stably identified while stationary. This would make the change non-breaking, and could be implemented asynchronously by each provider as they are made aware of the change. Great feedback @sven4all.
As noted, I have opened an additional PR which keeps bike_id, but specifies that IDs should be rotated after each trip. This is a non-breaking change.
Hi @morganherlocker, thanks for cross-posting this on my related issue (#131), but I have to say that I strongly disagree with trip-reconstruction being a vulnerability.
Like you, I have also played with it myself (link), and I think it will still be fairly easy to do even if we remove bike_ids - see here.
This seems to me like the very week way of stalking individuals and vulnerable populations. In my mind, this kind of tracking is very circumstantial and parallels to knowing someone that was in a certain neighborhood took a bus that might have sopped next to an abortion clinic.
I'm with @sven4all on this, and would even argue that we should make bike_id static for at least 24 hours. In my mind, our main goal is to help individuals and govs make efficient decisions about bike sharing and help save the planet (one step at a time 😃 )
Also related to #147
Hey @idoco, thanks for the feedback. 🙇
Like you, I have also played with it myself (link), and I think it will still be fairly easy to do even if we remove bike_ids - see here.
This is a good observation. I do think the risk for this type of strategy is significantly lower than deterministic linkage via bike_id. Since we do have a deterministic linked dataset as ground truth 😄, it is possible to measure the effectiveness of this probabilistic model, if needed.
In my mind, this kind of tracking is very circumstantial and parallels to knowing someone that was in a certain neighborhood took a bus that might have sopped next to an abortion clinic.
A big difference here is that dockless vehicles do not follow a fixed route, shared by multiple people. Watching a trip start right on the steps of a school and end right on the steps of a Planned Parenthood is a totally different level of risk than a bus route that happened to go by both. Solid re-id with this type of data was already proven publicly with the 2014 NYC taxi dataset and this data is just like that but live. A factor to consider here is user consent and expectations of privacy. I don't think 99.9% of people assume that their precise travel will be live public information when they enter one of these vehicles, even if they do assume the provider and city will securely archive where they went. As noted, I agree that cities & researchers need this information, but they already will get this information over authenticated channels like MDS.
I have to say that I strongly disagree with trip-reconstruction being a vulnerability.
If full resolution trip ODs were exposed to the internet by a major mapping provider, like Google Maps, it's hard for me to imagine most users seeing that as anything but a huge breach of trust and risk to their safety. At most established tech companies, even querying this data at a trip level without good reason would be a serious breach of protocol. I can only speak for myself, but I don't want the trips from my house to my doctor to be public because I decided to use a particular app to get there. First and foremost, we should consider how a diverse population of users would like this data to be handled.
Morgan's work clearly demonstrates that monitoring GBFS feeds allows you to trivially reconstruct origin-destination pairs for a trip. If you said to an average citizen: "the origin and destinations of every bike share trip is being published, in real-time", do you think that would match their expectations?
I think this issue deserves more scrutiny. GBFS feeds are often public, which means we should have a very high design standard for privacy. We should be concerned with both real and imaginable attacks, even somewhat contrived ones. Not only will that future-proof GBFS from more and more sophisticated attacks, but it will also prevent a situation where people don't want to use bikeshare due to a perceived privacy problem.
@HeidiMG
We had trouble processing your request. Please try again later.
@HeidiMG
We had trouble processing your request. Please try again later.
The issue of vehicle IDs (at the moment bike_id) being in the feed does raise some questions about data privacy, due to the potential ability to reconstruct trips with the data. (Although one would have to go many steps further to re-identify individuals, and there is a significantly lower likelihood of re-identification with dockless mobility data... as opposed to with ride-hailing or taxi trip data which have a higher likelihood of starting/ending at one's home).
However, removing vehicle IDs entirely also makes the data feed far less usable for cities that are holding mobility operators accountable with dockless mobility policies that may involve vehicle counts or parking durations. One potential solution to this issue is to develop data classification policy which could outline: 1) different categories of data (and potentially combinations), and specifically, those that should be classified as sensitive; and 2) guidance on how sensitive data should be treated; e.g. perhaps it should not be available in a public data stream that literally anyone can access, and may have guidelines attached to its storage, security, and use.
In this way, cities and operators could find ways to ensure that cities continue to access data that they need to effectively work with operators, while ensuring that user privacy is protected.
Although one would have to go many steps further to re-identify individuals, and there is a significantly lower likelihood of re-identification with dockless mobility data... as opposed to with ride-hailing or taxi trip data which have a higher likelihood of starting/ending at one's home
Re-identification and sensitive trip identification occurs at a lower rate with micromobility than cars, but it is critical to note that this does not mean it is not happening. The volume of data here is enormous and we are talking about percentages of identification per mode rather than a binary. Individual trip data should always be treated as sensitive, regardless of the context, unless that context is explicit informed consent to be public. Since posting this disclosure, I have been able to scrape roughly 170 million trips around the world, in real time. Within this data are likely 100s of thousands of trips that reveal sensitive information about individuals that the users absolutely would not want to be public. Micromobility is only marginally safer than cars when it comes to privacy and the difference is not significant enough to treat differently, considering the effort to extract is identical with the use of automated analysis.
With NABSA's current project focused on clarifying and enhancing GBFS, we've found a need to emphasize in the documentation that GBFS' purpose to provide "real-time data feeds in a uniform format publicly available online" means that "information that is potentially personally identifiable is not currently and will not become part of the core specification." (See #171)
Alongside that added emphasis in the documentation, we'd like to get versioning off the ground so that we can make progress on this issue.
PR #147 addresses this issue and represents a heap of input from GBFS consumers and producers. We're hoping we can solidify support around it as the next step forward. It would be great to get all of your input over there on the solution that has emerged so far as a "minimum viable proposal".
That said, I'd like to propose we close this issue and move relevant conversation there. @morganherlocker @sven4all @taraniduncan @menottim @lobenichou @DanHanf @flibbertigibbet @idoco @wiseman @jlev @mrjameshsu @rolyatmax @karussell @quicklywilliam @mobilitygirl
GBFS Dockless Trip Reconstruction
Vulnerability
Public GBFS dockless feeds are publishing live status data that can be used to derive full resolution origin destination data in real time. This vulnerabilty can be fixed with a simple change to GBFS, described below.
Solution
The data delivered by dockless GBFS is not intended to expose trips, but trips are trivially derived by storing state while periodically pinging for vehicle status. Removing the
bike_id
attribute from these endpoints makes it signifincantly more difficult to derive origin destination pairs. I recommend removing it entirely fromfree_bike_status
feeds, as implemented in this PR.Parties
The feeds affected are all commercial providers, as far as I am aware. In many cases, these providers are legally compelled by the respective regulatory authority to publish a GBFS compatible feed. A new version of GBFS that removes
bike_id
or makesbike_id
optional forfree_bike_status
feeds would allow commercial providers to easily patch the vulnerability without violating city regulations.Scope
This vulnerability affects any public GBFS feed implementing
free_bike_status
with dockless data. I am currently tracking live telemetry from hundreds of cities around the world from dozens of feeds. While GBFS is designed for bikes, note that GBFS dockless has been publicly implemnented across bikes, scooters, and cars.Demonstration
Linking trips for a few hours, I was able to derive full resolution data across over 50k unique vehicles and over 1 million estimated individual riders, mapped in real time. The data is live & multi-modal, with 1 minute temporal resolution and approximately 1-5 meters spatial resolution. I have provided the following static maps to demonstrate the vulnerability across a few cities, but the problem is global in nature, and not specific to a city, provider, or mode of travel. I have also included public source code for full transparency and verification.
IMPORTANT! - Location names provided for context, not as an assertion of legal jurisdiction!
Washington DC - Arlington, VA
Hollywood, CA - Downtown Los Angelos, CA - Silverlake, CA
Rockaway, Queens, NYC, NY
Seattle, WA - Bellevue, WA
Beverly Hills, CA - Santa Monica, CA
Louisville, KY
San Diego, CA