Disco API is tracking users (IP addresses)

joschi commented 3 years ago

As of 2021-03-20, the foojay Disco API is tracking the IP addresses of all users downloading a JDK: https://github.com/foojay2020/discoapi/blob/a2ba00cf9f9f44be3857e528c56de3cdadeb8453/src/main/java/io/foojay/api/DownloadManager.java#L68-L74

There is no reason to do this and it's not transparently documented that this is happening.

Additionally, there's no information about where the foojay Disco API is being hosted, who has access to the collected data, and what the collected data is being used for.

From a user's perspective, it would also be great if the download URLs for the JDKs would be directly available without another request to the Disco API with an ephemeral ID.

HanSolo commented 3 years ago

Yea we use the ip-address to geolocate the request. This is also the reason for the second call using the ephemeral id. Because we only provide the links to the downloads , we need a way to figure out how and from where the api is used. The api is hosted on AWS and at the moment only er have access to the data. Because we cannot track if a jdk was really downloaded but if only the link was requested the question is how useful this data is at all. If we will get useful data out if that statistics we might make this data available to the public and if not we might drop it completely. For the moment we will continue geolocating the requests.

giltene commented 3 years ago

Let’s drop the IP logging, since it seems to cause concern and friction. As hosting gets spread to additional regions and clouds, we might instead watch the overall load stats in regions to see how to balance then I guess.

HanSolo commented 3 years ago

After we discussed that point internally we came to the conclusion to drop ip-address tracking. So the new version without tracking will go to production asap.

geertjanw commented 3 years ago

It's kind of a pity because now we're not able to provide the service of showing where requests are coming from per OpenJDK distribution. However, @giltene's suggestion makes sense and that kind of data can be collected, if needed and if there's interest in that, in other ways, e.g., based on hosting location rather than individual IP location..

giltene commented 3 years ago

Yeh, there is/was no nefarious intent here, for sure, and rather than argue the technical merits and community value of the information, it’s simpler to just remove the stuff that seems to worry folks in that regard. Let GitHub, Azure, and AWS keep their own IP tracking data to themselves...

giltene commented 3 years ago

@joschi the ephemeral ID step is intentionally in place to prevent caching layers from expecting distro URI lists to remain the same for any non-ephemeral length of time. This choice is based on painful and hard earned experience with how doing otherwise (and enabling/encouraging caching at various levels) leads to nightmare scenarios when updates are rolled out and (empirically, as has happened in various distros and even to the upstream ga in the past couple of years) need to patched/replaced/build-revved/rolled-back on some emergency basis. The ability for a distro to invalidate prior URIs is key to avoiding hours (or days) of broken infrastructure downstream when mistakes or post-publication critical bugs are found. The discipline of using ephemeral IDs and making sure they break promptly every time help make sure that whatever changes a distro makes are promptly reflected, by preventing misguided caching of result sets. With ephemeral IDs, using cached result sets would break every time (and in typical testing or use attempts) rather than just in the rare cases where the URIs actually ended up being changed by a distro (which is when this really matters).

Of course, there is nothing to prevent an api user from drilling into each ephemeral ID separately, extracting it’s current mapping to a URI, and caching that result to avoid future api calls. But the complexity and the obvious counter-to-intention nature of doing that will hopefully make it clear to the user that doing so carries some “you’d better know exactly what you’re doing” responsibility, and “you should suspect that doing this may cause issues”. E.g. doing this and caching stuff for 3 minutes to reduce out-calls in a very hot loop is “maybe ok”. Doing the same and caching for 1-2 days is definitely not a good idea (e.g. those who’ve done that in the past 2 years would have experienced production breakages or security lapses with at least 3 of the quarterly updates).

joschi commented 3 years ago

It's kind of a pity because now we're not able to provide the service of showing where requests are coming from per OpenJDK distribution.

Why would that be important?
If this is important, why not resolve the country of the user via GeoIP and only store the country the requests originated from?

joschi commented 3 years ago

This choice is based on painful and hard earned experience with how doing otherwise (and enabling/encouraging caching at various levels) leads to nightmare scenarios when updates are rolled out and (empirically, as has happened in various distros and even to the upstream ga in the past couple of years) need to patched/replaced/build-revved/rolled-back on some emergency basis.

Interesting thought although I would expect such a logic implemented at the JDK vendor, not some directory service such as Disco API.

Is the refresh interval of the vendor data in Disco API even low enough that this would work like you described in a sensible way?

https://github.com/foojay2020/discoapi/blob/3433cc520aae4a5a16d47c05c5ade91031072710/src/main/java/io/foojay/api/CacheManager.java#L114

geertjanw commented 3 years ago

It's kind of a pity because now we're not able to provide the service of showing where requests are coming from per OpenJDK distribution.

Why would that be important?

If this is important, why not resolve the country of the user via GeoIP and only store the country the requests originated from?

Sure, that could be the way to do it. And I'm pretty interested in knowing where in the world particular OpenJDK distributions are used, aren't you? Not sure if it is 'important', though certainly interesting, at least.

foojayio / discoapi

Disco API is tracking users (IP addresses) #20