Replace hashed email with manifest-based offline check

kzu commented 11 months ago

It was brought to my attention that this wasn't sufficiently anonymizing given that especially for corporations, the pattern for emails is not hard to probe if you have a list of emails from somewhere, but attempting to access the cloud URL for the sponsorship.

The current proposal (client-side CLI implementation PR) would work as follows:

Users build for the first time with a SponsorLink-enabled library. Analyzer looks for a manifest as a user envvar. Since it's not found (or empty), it cannot determine sponsoring status and informs (Info diagnostic) that Project X is seeking sponsorships. If you are already a sponsor, please sync your sponsorships using gh sponsors sync....
User follows the diagnostic link, which explains the following steps clearly
Install the GH CLI and the gh-sponsors CLI extension (basically running gh extension install devlooped/gh-sponsors).
Runs gh sponsors (sync being the default command). On the first run, the tool explains again what's going to happen and performs the following steps: a. Creates a user envvar with a random GUID to use for salting all hashes b. Gets the user's active sponsored accounts c. Gets the user's verified emails d. Gets the user's organizations and their verified domain(s), as well as their sponsored accounts e. Hashes each user's email c) with each sponsored account b) (salted with a)) and turns them into JWT claims ("hash"=hash) f. Hashes each verified org domain d) with sponsored accounts too (salted with a too), and turns them into JWT claims too g. POSTs this to a GitHub-authenticated SponsorLink endpoint that signs the JWT with SL private key. All the endpoint validates is that the logged in GitHub user (via Auth0) is the same as for the GH CLI. h. The backed responds with a signed JWT with an expiration that covers the current month (sponsorships expire at the end of each month). i. The token is saved to the envvar checked in 1)

On a subsequent build:

Analyzer sees both the manifest and salt envvars
If the token is expired or invalid (i.e. not signed), an Info diagnostics tells the user to run gh sponsors again.
Analyzer does hash(salt, email, sponsorable) and tries to find that claim within the JWT (all local). It also does a fallback check for hash(salt, domain(email), sponsorable), to support org sponsorships.
If the hash is found, the user (or an org) is sponsoring. Otherwise, an Info tells the user to please sponsor, with a link to do so for the given sponsorable (project/user/org).

Of note:

User email(s) are never accessible to SponsorLink, since installing the GH App (on the website) is no longer required
Organization info is also not accessible to SponsorLink (no need for GH Admin App install on the website)
It is inconvenient to work around, but obviously not impossible (but that is out of scope anyway).
The one-time install of the GH CLI and gh-sponsors extension is a barrier, but so was installing the GH SponsorLink app (on the website).

The goal is for integrators to just have a documented standard mechanism for verifying the JWT manifest token, even without any SL-provided code. But a simple loose file "helper" should be provided for simplicity.

GH CLI experience is similar to the following:

teo-tsirpanis commented 11 months ago

An idea I had is to use a CLI tool to authenticate to GitHub and prompt to install the SponsorLink app, and then get a certificate from a backend service and put it in the certificate store. This way we address two of the biggest complaints with the existing approach:

Users give explicit consent to provide their emails.
Building a project that uses SponsorLink will have less overhead. Instead of spawning a process and making a network request, verifying that you are a sponsor will be done by verifying a certificate. If that certificate does not exist or is expired (each project can choose the lifespan of its sponsorship certificates), the user will be warned accordingly.

ScarletKuro commented 11 months ago

Instead of spawning a process and making a network request, verifying that you are a sponsor will be done by verifying a certificate.

This is what I believe the author should consider as well. The validation process must work in offline mode. If I haven't forgot how asymmetric encryption works then achieving this should be possible. One potential solution involves designing a webpage that enables users to input relevant details(email address) to verify that this person is a sponsor. The server would generate a certificate. This certificate would be downloadable and placed within in the solution directory. The analyzer would verify the certificate that it was signed by a trusted source. Sadly, this could be abused if the user were to share their generated certificate. But I'm not sure if more complex solution is required, after all I don't think that author should focus in making some DRM. There is alternative approach, which can be abused less. It would require to write a cross-platform tool, the user(sponsor) would need to run this tool(it should include terms of use that user accepts to avoid any legal issues), this tool collects the HWID of this machine and generates a certificate. Then the user goes on the page, enters some information (like email) the webservice validates that this is sponsor and allows to upload the certificate, the website signs it. This signed certificate would be downloadable and placed within in the solution directory. The analyzer would verify the certificate that it was signed by trusted source and decrypt the HWID from it, then the analyzer would get the HWID from current machine and compares it with the one from the certificate, if they match then congratulation, you are the sponsor.

egil commented 11 months ago

If the premise is that users need the SponsorLink app installed on their dev boxes, perhaps that app could require the user to sign into GitHub (e.g. like the Co-pilot plugin does) and then it could query GitHub for the projects the user sponsors.

That information could then be made available locally on the machine to the analyzer running on it. That way, there is no "behind the scenes" communication going on at build time. SponsorLink app downloads and caches a list of sponsored projects to the user's machine.

It could also enable users that are part of an organization that sponsors a project to be counted as a sponsor, i.e. the organizations are the sponsors and not the individual devs working for the company. In many places, organizations are able to deduct sponsorships from taxes, whereas individuals are not, and would thus have to go through expensing this to their work if the work is to cover the sponsorship.

teo-tsirpanis commented 11 months ago

if the user were to share their generated certificate

My thought was to put the certificate in the OS's certificate store, marking it non-exportable if possible.

cmjdiff commented 11 months ago

My suggestion is to replace it with nothing, because that's the only acceptable solution under EU law, and the only solution many corporates will accept. You can't do the mapping without harvesting data to create a persistent identifier, and both halves of that are illegal in Europe under separate enactments (data protection law for the persistent identifier, computer misuse law for gathering the data you need).

teo-tsirpanis commented 11 months ago

I highly doubt that this is illegal in Europe.

And requesting the email can be skipped if the sponsoring accounts of a user can be determined from the GitHub API without it.

cmjdiff commented 11 months ago

I highly doubt that this is illegal in Europe.

You're right. It probably isn't. But that's not the issue, and I'm pretty sure you already knew that.

egil commented 10 months ago

My suggestion is to replace it with nothing, because that's the only acceptable solution under EU law, and the only solution many corporates will accept. You can't do the mapping without harvesting data to create a persistent identifier, and both halves of that are illegal in Europe under separate enactments (data protection law for the persistent identifier, computer misuse law for gathering the data you need).

This is of course not at all true. You can collect all the data you want in the EU as long as you just ask the user and the user gives consent. That is the entire problem with the current implementation. One or more emails are collected without asking permission first.

TeddyAlbina commented 10 months ago

My suggestion is to replace it with nothing, because that's the only acceptable solution under EU law, and the only solution many corporates will accept. You can't do the mapping without harvesting data to create a persistent identifier, and both halves of that are illegal in Europe under separate enactments (data protection law for the persistent identifier, computer misuse law for gathering the data you need).

Read the source code, nothing is happening if you don't install the github sponsor app and give explicit consent and also if you are not using an IDE

mattleibow commented 10 months ago

What about api keys? NuGet, Azure and all the others have api keys. Not sure how a hash of an email is somehow less private that an api key that directly maps to your email on another service.

However, just in case, may e people can just use a GitHub access token to read? So the token is the key? Then corporates can just set an envvar on ci/local and the build can do that?

we-sell-bags commented 10 months ago

What about api keys? NuGet, Azure and all the others have api keys. Not sure how a hash of an email is somehow less private that an api key that directly maps to your email on another service.

However, just in case, may e people can just use a GitHub access token to read? So the token is the key? Then corporates can just set an envvar on ci/local and the build can do that?

You make two faulty assumptions.

All development machines can see a live network
SHA 256 is no different than other API keys... a SHA 256 has has a very very low collision, that basically makes the hash unique..... for each email account & as such it is an identifier that can uniquely identify a SPECIFIC user to EXTERNAL sources.... by using some simple code, it becomes a situation where you can probe the back end database for SHA hashes... and as such identify specific users , it matters not that you do not have an "exploit", the fact is , it can uniquely identify people to EXTERNAL sources....

egil commented 10 months ago

What about api keys? NuGet, Azure and all the others have api keys. Not sure how a hash of an email is somehow less private that an api key that directly maps to your email on another service.

However, just in case, may e people can just use a GitHub access token to read? So the token is the key? Then corporates can just set an envvar on ci/local and the build can do that?

The point is that you have to get consent from a user before collecting their info. A 3rd party app or web site should not just go and collect your info without your consent, nor should it go and fish out API keys you may have access to without asking permissions.

For example, If SponsorLink uses OAuth2 to login to GitHub, it can request the users email address, and the user has to click the big green "Install and authorize" button. That is the user giving consent, and in that case, SponsorLink is not doing anything wrong.

macco3k commented 10 months ago

Read the source code, nothing is happening if you don't install the github sponsor app and give explicit consent and also if you are not using an IDE

If I read the explanation in the main readme, that does not seem to be true:

Check number 3. explicitly uses the email collected in 2. to send a remote request. All without any user consent.

kzu commented 10 months ago

@macco3k the SHA256 of the email. But point taken. That's what this issue is about.

@mattleibow I was thinking just depending on the GH CLI, so I don't have to deal with auth at all.

jozefizso commented 10 months ago

if the user were to share their generated certificate

My thought was to put the certificate in the OS's certificate store, marking it non-exportable if possible.

No licensing systems sends you certificate with a private key. They sign the license with their private key.

jozefizso commented 10 months ago

if the user were to share their generated certificate

My thought was to put the certificate in the OS's certificate store, marking it non-exportable if possible.

And obviously the private key material is stored somewhere and administrator can backup/export the private key.

https://www.yuenx.com/2022/certificate-security-export-cert-with-non-exportable-private-key-marked-as-not-exportable-windows-pki/

seanterry commented 10 months ago

SHA256 of the email. But point taken.

Hashing does not anonymize data.

A hash of PII should be treated as PII, as it effectively produces an easily-reproducible unique identifier. Email addresses are such a small domain of values that there is an infinitesimal chance of collisions and one can produce the hashes for simpler email addresses (combinations of common names @ common domains) very rapidly.

For those that require FIPS-compliance in the US, this is not an approved use of any hash algorithm. Hash algorithms are only approved for verifying message integrity, in approved digital signature algorithms, and in approved key-derivation functions.

The current implementation is not sound from a technical standpoint. A license, digitally signed by you and verifiable at runtime, would be a better implementation. Though do allow for multiple types of algorithms as nothing is universally acceptable. I'm not even going to get into the privacy laws you'll now need to be comply with, nor the legal liability you'll be taking on with it being a paid product. Those are better discussed with your attorney, and not random people on the internet.

As a developer of some FOSS libraries that next to nobody uses, this move sucks. I don't monetize those projects, and consequently won't use any that require any monetary investment from me.

As a manager of a development team, this would never pass muster, and would be an instant "pin/fork a version that doesn't require payment or else eliminate it as a dependency". And I say that not because we're cheap and don't want to pay for things; I say that because the implementation is poor, the value provided is minimal, and there is no option to sponsor at an organizational level to unburden my team. And perhaps most importantly, it makes me look like an idiot to my boss when we suddenly have an unplanned expense (no matter how small).

For example:

SponsorLink will never interfere with a CI/CLI build, neither a design-time build.

While I am certain there are some organizations that prioritize getting their builds out with no delay or interference, mine is not one of them. You add 1, 2, or even 10 seconds to my CI builds, then you're wasting an almost-free amount of compute time.

If you add a millisecond to my developers' unit test times, we're going to break out the pitchforks and torches. Moq was already painfully slow, and only gets a pass because it was historically more convenient to write tests with than arranging an integration test. But with AI making integration tests far less tedious to write, and which typically execute as fast or faster than Moq while providing much greater assurances of correctness with greater readability... Moq's current value lies almost entirely in not needing to rewrite existing tests. And while that is a tangible and measurable amount of value, it's not an intolerable one.

But the biggest issue--the underlying one that I haven't seen mentioned in this discussion--is the stain this places on the FOSS community in general. People want to monetize their work, and that is understandable. And while I don't know what the right way is, this is definitely one of the wrong ways. In this case, suddenly charging for something that was previously free is exactly what businesses that were wary of using FOSS were afraid would happen. There was so much FUD around FOSS in years past that it took ages to be allowed to use it in our code.

Once we got past that (look, even Microsoft is doing open-source now), it put developers in the driver's seat and we became able to stop re-inventing the wheel and use existing packages because they saved us time. We didn't use free ones simply because we were cheap; we used them because it was the path of least resistance. They offer value to our productivity that doesn't require justifying a purchase order. Once money is involved, things get painful.

But today, FOSS has become yet-another vector for malware. And where once we feared submarine patents or unfixed defects in abandoned projects, we're now being scalded by submarine licensing fees. We either need to admit we made a poor judgement call by asking the company to pay for it, or else pay for it out of our own pockets. It's a breach of trust that affects the whole community.

Perksey commented 10 months ago

And now I present the most overkill solution to preserving anonymity:

Use a k-anonymity model with the repo name (this prevents giving out email digests willy nilly, the digests are always scoped to the repo). Digest is calculated as H(sponsorEmail || sponsoredRepo) where H is the hash function. API exposes an endpoint that will take the first 5 characters of the digest, and return the suffixes of matching digests.

FindDigestSuffixes(digestStart)
    // Read public blob with name digestStart
StoreDigest(digest)
    // Append to a public blob with name digest[:5] a line containing digest[5:]
AcknowledgeNewSponsor(sponsorEmail, sponsorRepo)
    digest := H(sponsorEmail || sponsorRepo)
    StoreDigest(digest)

There is no solution that prevents the storage of PII given that what you want to do here is identify individual persons (sponsors), but you can at least stop users who have not agreed to share their email with you by linking their account sending their PII to you.

DavidAtImpactCubed commented 10 months ago

And now I present the most overkill solution to preserving anonymity:

Use a k-anonymity model with the repo name (this prevents giving out email digests willy nilly, the digests are always scoped to the repo). Digest is calculated as H(sponsorEmail || sponsoredRepo) where H is the hash function. API exposes an endpoint that will take the first 5 characters of the digest, and return the suffixes of matching digests.
FindDigestSuffixes(digestStart)
    // Read public blob with name digestStart
StoreDigest(digest)
    // Append to a public blob with name digest[:5] a line containing digest[5:]
AcknowledgeNewSponsor(sponsorEmail, sponsorRepo)
    digest := H(sponsorEmail || sponsorRepo)
    StoreDigest(digest)
There is no solution that prevents the storage of PII given that what you want to do here is identify individual persons (sponsors), but you can at least stop users who have not agreed to share their email with you by linking their account sending their PII to you.

IIRC, haveibeenpwned.com does something similar to this to avoid ever sending email/password hashes over the wire: https://www.troyhunt.com/understanding-have-i-been-pwneds-use-of-sha-1-and-k-anonymity/

kzu commented 10 months ago

I've been exploring creating a GH CLI extension (initial proof of concept here). It might be the easiest to integrate since it automatically conveys that you have to be signed in to the GH CLI to operate:

The idea is that the CLI tool would allow you to run gh sponsors sync [sponsorable project], which would download a SponsorLink "license" file. If you hadn't installed the GitHub App at that moment (which requires your permission to access your email), it would tell you so and offer to go to install it (just like I do via a diagnostics analyzer link right now).

At (IDE-only, like today) build time, the analyzer would look for this file and verify your (local git configured) email against it, entirely offline, and at most (say) once a day. So the email would never leave your machine if you hasn't installed the GH app at all, and neither would it if you had.

Since a calendar month is the cut-off for sponsorships (one-time or recurring), you'd need to download this file periodically, as it will therefore have an "expiration" built-in.

The file would need to contain sufficient information to:

Verify current working dir git repo email is one associated with sponsorable/account as a sponsor
Support org-wide sponsorships (so, email domain-only check)
Use a k-anonimity scheme when generating the file to avoid leaking PII

So, based on @Perksey suggestion, the algorithm that would also support org-wide sponsorships might look like the following:

# this runs locally on the dev's machine, via gh CLI extension
sync ($sponsorable: "moq")
    foreach (org in me if org.verified)
        hash = H(org.domain | $sponsorable)
        # digest fetched from blob storage
        if (digest(hash[..5]) contains hash[5..])
            append($sponsorable.txt, hash)
    foreach (email in me where email.verified)
        hash = H(email | $sponsorable)
        # digest fetched from blob storage
        if (digest(hash[..5]) contains hash[5..])
            append($sponsorable.txt, hash)

# this runs on backend *after* installing the GH App that 
# authorizes SL to read user's email(s)
sponsored($sponsor, $sponsorable)
    if ($sponsor is user)
        foreach (email in $sponsor where email.verified)
            hash = H(email | $sponsorable)
            append(diggest[hash[..5]], hash[5..])
    elseif ($sponsor is org && org.verified)
        hash = H(org.email.domain | $sponsorable)
        append(diggest[hash[..5]], hash[5..])
        # if org has a website, i.e. gh api orgs/particular
        hash = H(org.url.domain | $sponsorable)
        append(diggest[hash[..5]], hash[5..])

unsponsored($sponsor, $sponsorable)
    # same but removing hashes

# this runs locally in the editor, note: no network requests
check($sponsorable: "moq")
    if (!exists($sponsorable.txt)) return false;
    if (contains($sponsorable.txt, H(email | $sponsorable))) return true;
    if (contains($sponsorable.txt, H(email.domain | $sponsorable))) return true;

Does that sound correct? From @DavidAtImpactCubed's link to Troy's most excelent blog post, I take it that this set up would be fast, PII-preserving and accurate, even if we switched to SHA-1?

Thanks a lot for the feedback!

Perksey commented 10 months ago

I would probably switch to SHA256 or SHA3-256 given the known cryptographic problems with SHA1 to cover your backside a bit, but otherwise sounds good to me.

kzu commented 10 months ago

Troy's fairly recent post on the topic makes a strong case for actually sticking to SHA1 though: https://www.troyhunt.com/understanding-have-i-been-pwneds-use-of-sha-1-and-k-anonymity/

DavidAtImpactCubed commented 10 months ago

On reflection, I'm not sure that k-anonymity (a-la-haveibeenpwned.com) is achievable in your use case. As I understand it, it relies on being able to segment the reference data set (in your case sponsors) into groups which share a common hash prefix which are large enough that the response contains many possible matches (the "k" in k-anonymity). In the case of haveibeenpwned, the responses contain a large number of matching hashes (see an example here: api.pwnedpasswords.com/range/11111)

In your case: a) I'm not sure that the list of sponsors emails is public (that might be wrong, I'm not familiar with the domain!) b) Even if the list of sponsors is public, I'm not sure there are enough of them for k-anonymity to work.

The flip side of this is that if the number of sponsors is public, and is a short list, then why not respond with a list of hashes of all sponsors, and do all checks client side?

If your list of sponsors is not public, none of these approaches are valid.

Perksey commented 10 months ago

@kzu The use cases are different. HIBP protects known-public data therefore the SHA-1 just acts as a minor obfuscation as well as handling the compromised data in a uniform manor. I strongly discourage using a known-insecure hash for data that is not strictly public.

@DavidAtImpactCubed As discussed, the fact that sponsors will have PII stored and exposed as pseudononymised is unavoidable, k-anonymity isn't being used to prevent this. This model is only used to prevent entire hashes of emails being sent to the server by those who have not agreed to it (those who have their data stored agree to it at time of linking the GitHub app)

Perksey commented 10 months ago

Also I'm not sure adding the node ID to the hash would work (better though it might make it). The hash data needs to only include data that the analyser can obtain, which will likely be the email obtained from git and the repo name of the consuming package. I don't think the analyser can obtain the node ID, unless you mean the ID of the project (which can probably be done at pack time)?

But yes adding as many domain-specific components to the hash will make it better.

Ideally we'd have both random and non-random elements, which could be done by having H(H(sponsorEmail || repo) || additionalSalt) where the resultant outer digests (the entire outer digests given that they'd be unrelated to the inner digests) are stored as {innerDigestFirst5Chars}.txt and the salt as {innerDigestFirst5Chars}.salt.txt. The idea being:

Analyser computes inner hash
Analyser gets salt for inner hash prefix
Analyser computes outer hash
Analyser retrieves outer hashes and compares

Where the salt would be a true random string.

But at this point the security benefits do start to wear off, and this is more a peace of mind suggestion in line with some of my experiences with NIST SP 800-90.

kzu commented 10 months ago

@Perksey yeah, I think complicating the work the analyzer needs to do later on will have a non-zero impact on builds, so better not do that unless it really adds value.

So from the point of view of not sending anything anywhere before you have explicitly agreed on sharing your email, I think this approach is sounds, and the k-anonimity aspect on top makes it hard to brute-force reverse engineer the connection between emails, GitHub accounts and sponsorships, right?

Perksey commented 10 months ago

Yep, agreed.

NzKyle commented 10 months ago

I think that you're going to need to explicitly ask for user's permission to start running commands on their machines in the background at build time.

I don't think you'll ever get mass by-in for a product that does that all without explicitly being permitted to do so.

I certainly don't expect libraries to do anything other than their express purpose as used by me, in my code.

clyvari commented 10 months ago

@kzu

So basically the discussion boils down to

Installing a license file on a computer
when building the project in an IDE, the build process can lookup this file and validate it

--

For point 1: The GitHub CLI extension seems like a pretty good idea. So you can "login" to your SponsorLink backend and retrieve a "license". If I understand correctly, it's doable ?

Either use the configured GitHub CLI Identity
Or possibly use a separate identity, that will be configured specifically for the gh-sponsor extention (gh sponsors login ...).
Or a mix of the two maybe ?
- gh sponsors login --github uses Github Cli identity~~
- gh sponsors login $email and then starts a OAuth flow to login to your SponsorLink App~~
- I'd rather be able to use the last version, because I might have different profiles / emails address in different repos, and if I were to use SponsorLink I would like to re-use my Sponsor Link identity across projects, and not my corporate email for instance. Edit: I forgot the SponsorLink app is a Github app only, so a github identity is required. Still, I think It would be good to ensure it's easy to re-use the same Sponsor Link identity across project that use a different GitHub Identity

To me, this approach take care of all privacy related issues.

--

For point 2: two things are possible:

Check the validity of the license online (maximum security, but one major drawback is that obviously network connectivity is required, and "people" might not be happy with the "phones home" thing)
Check the validity offline, with validating the license certificate signature for instance.
Possibly a mix of the two, with the latest being used if network connectivity checks fail.

--

Overall, I would not complexify things too much with K-anonymity and such. Just manage the identity with a standard console login with OAuth. It's enough for pretty much ALL apps out there, I don't know why SponsorLink only would have to deal with K-anonymity.

For the license file itself, I would just go with a JWT token (possibly the same one returned after the "login" step):

You can add all the info you need
It's signed so you can validate authenticity
It has an expiration date
You can add as many claims / scopes to the token to specify what your user has access to.
You would just need to send this token and/or validate it during the build process
It's already an accepted practice to send JWT tokens alongside server requests to authenticate a user

Basically, you are developing a full fledged licensing system

psimsa commented 10 months ago

Here's a thought: The verification happens in the context of a nuget specific package (e.g. Moq). So use the package ID (or some other package-specific info) as salt for the email hash. It will increase storage requirements on the CDN for sure, but given the amount of data that needs to be actually stored when the most expensive (premium) storage runs at 15 cents per gigabyte per month I think the cost increase will be hardly noticeable. You could then also simplify the check - a 404 on a has comprised of email and package id means the user is not sponsoring the package (though we don't know whether it's the lack of app or lack of sponsorship). While it has drawbacks, it's faster.

Perksey commented 10 months ago

@prisma this is why the GitHub repo was proposed to be included (as a salt)

Perksey commented 10 months ago

Re the file-based concepts, I acknowledge that there has to be some solution for airgapped environments or anyone who doesn't want a phone-home solution, but I think that transferring a file will prove more trouble than it's worth given that sponsorships only last a month.

psimsa commented 10 months ago

@Perksey what if the code is not on github though? Or would it take the origin upstream? While package ID can be implemented as a compile-time constant, retrieving other stuff from filesystem / git repo takes additional IO/time...

Perksey commented 10 months ago

I think there is a baseline assumption that packages that ask for GitHub Sponsors are hosted on GitHub.

psimsa commented 10 months ago

AAAAAAAAAAAAAAAAAAAAAAAAA.... you meant github repo of the package, not github repo of the project that consumes the package........ sorry, it's late :-)

psimsa commented 10 months ago

But, isn't it unnecessary extra work (to get the repo ID) when you can simply burn-in an ID into the compiled DLL?

kzu commented 10 months ago

@psimsa burning in the product ID (right now, the diagnostics are grouped by product, not package, so you don't get 5 messages for using the ThisAssembly meta-package, for example) sounds reasonable IMO. Arguably, a malicious actor could grab all nugets in nuget.org, scan for this hardcoded ID, and they would know the salt. Then (supposedly) they could get the list of emails, hash both, and we'd still get people complaining about PII.

Which is why I'm feeling more and more inclined towards the k-anonimity approach (which doesn't add much additional work on the technical implementation side, TBH).

@clyvari I like the JWT idea! I was thinking the older XML signature stuff, but JWT seems like it solves the same issues and is perhaps simpler.

For a v2, I'd not complicate much the GH login story. A perhaps not well-known feature of GH accounts is that you can add multiple emails your personal one (including your work email), verify them all, and have them be 100% private and invisible to anyone. Yet that info can be used to map your sponsorship even to an org (via the email domain). So you don't really even need to have two GH accounts for the sponsorlink attribution to properly resolve.

Perksey commented 10 months ago

Arguably, a malicious actor could grab all nugets in nuget.org, scan for this hardcoded ID, and they would know the salt. Then (supposedly) they could get the list of emails, hash both, and we'd still get people complaining about PII.

The salt isn't strictly sensitive, it's just so prevent extant hashes of emails (i.e. from a breach) from being cross-referenced. If you consider a password called "password", you can run that through a hash with a salt but without knowing that the password is "password" as well as the salt, you won't know that the hash is for "password" + the salt. It's basically just protecting against compromise (which in this case, we are intentionally compromising), as it is assumed that recomputing the hashes is hard work. It is true that it could be recomputed, however you'll have to recompute every single email in a list of breached emails for each repository just to be able to match, and even if you did that what would you gain? The hash makes it more secure yes, but the main intention is secure data in transit and easy handling of the stored data.

TsengSR commented 10 months ago

This is of course not at all true. You can collect all the data you want in the EU as long as you just ask the user and the user gives consent. That is the entire problem with the current implementation. One or more emails are collected without asking permission first.

That's completely wrong to start with. GDPR requires you to collect and process only data that's absolutely necessary to offer a service. Email address isn't such one, for the sake of authentication or licensing, since technically a licence file or licence code is suitable for that.

cmjdiff commented 10 months ago

That's completely wrong to start with. GDPR requires you to collect and process only data that's absolutely necessary to offer a service. Email address isn't such one, for the sake of authentication or licensing, since technically a licence file or licence code is suitable for that.

This is correct. Anything that's strictly necessary for something to work - as in, it's literally impossible to make it function otherwise - falls under the second grounds of necessity. If that doesn't hold, that grounds can't apply. (See EDPB finding in NOYB's case against Facebook in Ireland.)

Narshe1412 commented 10 months ago

And it also establishes very strict regulations about where you can store that data, the rights to access and deletion for their owners. If that data is travelling outside the EU it is a big NO-NO.

cmjdiff commented 10 months ago

Data residency is complicated, particularly now that GDPR is being cloned. In general, the principle is that data shouldn't go anywhere where proection is substantially weaker than it provides. It would generally be a safe assumption tht anywhere that adopts this style of regulation would likely permit transfer into the EU, so that'll be the place to put things.

kzu commented 10 months ago

@TsengSR @cmjdiff since all a nuget-provided library/analyzer running in a build inside an editor has access to, is the current repository email in order to perform the lookup and mapping against an offline and 100% local file, I'd say it qualifies as necessary (to verify "ownership" of the local file, strictly). Otherwise, anyone could just share the file with the whole team and the check would become instantly ineffective.

kzu commented 10 months ago

Updated the issue title and description with details of the first pass at an improved implementation in the linked PR.

Please do let me know what you think!

TsengSR commented 10 months ago

@TsengSR @cmjdiff since all a nuget-provided library/analyzer running in a build inside an editor has access to, is the current repository email in order to perform the lookup and mapping against an offline and 100% local file, I'd say it qualifies as necessary (to verify "ownership" of the local file, strictly). Otherwise, anyone could just share the file with the whole team and the check would become instantly ineffective.

You can try and wind however you want, you won't have success with this. People and company won't get themselves be black-mailed by you, since it doesn't add business value and that's all companies care.

Most will migrate away from Moq and the rest that can't can just fork it and make their own builds w/o your malware or a group will from that maintains a malware-free fork of it.

The beauty on OSS licencing is, it works for and against you. You can't even prevent forking or people applying upstream fixes to their forks, no matter how much effort you put on it or on your malware here.

Its BSD licences, so anyone can do the fork it and apply upstream fixes and features to it, minus the malware within it. And you can't even do anything about, not even change the license, since that would require the agreement of every single contributor.

So ironically, the only option for you is, as said earlier, to offer an separate extension under commercial licence to it with enterprise features, support and SLA.

For a huge company that would have 10 or 100 thousand tests it may be to expensive to migrate away to a different framework, but still would be cheaper to hire someone to maintain a fork of it AND pay them while you wouldn't see a single cent from it. Ironic, isn't it?

mattleibow commented 10 months ago

@TsengSR I would suggest you offset your angry comment with a random GitHub sponsorship. Find a repo that you can love and send a dollar. OSS is about people sharing and working together to create something. Your comment is the opposite.

cmjdiff commented 10 months ago

OSS is about people sharing and working together to create something. Your comment is the opposite.

You're missing an important qualifier there. OSS is about creating something good. Sometimes people are pig-headedly on the wrong path, and it's both acceptable and necessary to set them right. This is the equivalent of "removed 500 LoC, most productive day ever".

It is both acceptable and necessary to point out that in order to pursue this questionable idea, @kzu is walking past multiple proven viable options for getting himself paid. And if he continues walking past them, it continues to be acceptable and necessary to point it out.

cmjdiff commented 10 months ago

@TsengSR @cmjdiff since all a nuget-provided library/analyzer running in a build inside an editor has access to, is the current repository email in order to perform the lookup and mapping against an offline and 100% local file, I'd say it qualifies as necessary (to verify "ownership" of the local file, strictly).

@kzu That's not how "necessity" works here. It doesn't qualify as necessary just because a thing you want to do won't work without it. GDPR isn't about things you might want to do, it's explicitly about protecting people against things you might want to do (it was designed with Big Tech in mind, who have a history of abusing contracts to do things they might want to do which don't necessarily align with their users' interests). Article 6(1) lays out the various grounds. To qualify as necessary, it must be that a thing the user wants to do won't work without it, and not just because you rigged it to break if they don't accept the thing you want to do.

The canonical example is that if you want a phone line and internet service, a telco has to install some equipment at your home, otherwise the service won't work. Therefore, they need to know where you live so they know where to install it. What you're arguing here is that your name and address, along with the phone number they assign you, would be necessary to compile a complete directory of their customers. But that wouldn't fall under necessity, because the directory isn't strictly necessary to providing you with internet service. Maybe selling the directory subsidises the service, but that doesn't make it contractually necessary - it's been understood this way from the beginning, and affirmed recently against Facebook by EDPB and CJEU.

To rely on the contractual necessity grounds, you would have to show that it was strictly necessary for SponsorLink to do what it proposes to do in order for whichever library the user wants to use that depends on it to work. Not simply fail to run or install because of a failed dependency, but actually not function. In much the same way that "forced consent" isn't real consent, "engineered necessity" isn't real necessity.

Unfortunately, this would be unavoidable. It's not something you can find an implementation workaround for. It's not even a design flaw. It's an inherent and unavoidable part of the very thing you're trying to do - connect a local user to remotely-held sponsorship data. At the very least, if you were challenged on it (and GDPR explicitly has extraterritorial effect), you'd have to prove that you did all of this.

Honestly, the best way forward is to just give up on tracking and linking, and stick to static messages. There simply is no way to do what you want to do here without data protection implications There's no solutionising to be done here. It's literally impossible, because GDPR in particular is explicitly designed and intended to make it impossible. It's been written in such a way that it is practically impossible to interpret it in any way other than intended. That's not to say that no other interpretations may exist, but the combined might of the Big Tech legal corps has thrown everything they can at it and can't make anything stick. Static messages would make all of this go away because there is no targeting and no linking, and therefore no PII in any form, obfuscated or otherwise, anywhere - not on the client, not on your service. If you want to put a message in front of people where they don't have a guaranteed legal right to not see it, a static message is the way to go.

TsengSR commented 10 months ago

OSS is about people sharing and working together to create something. Your comment is the opposite.

Yea, but what he is doing is exactly he opposite, just because he is too lazy to register a company and offer paid support, SLA and/or a paid extension for it he chooses to blackmail everyone as a source of money, just because he lost any connection to the reality. That's not only ruin his reputation, but also in the OSS as a whole, making donations.

What he fails to see is, that companies are not charities and hence companies don't donate. You can argue that of being a bad thing or not, but thats how the real world functions. Companies pay for value added. In corporate terms, that's SLA, priority support and value features that are not available otherwise w/o paying.

kzu commented 10 months ago

Companies don't donate

https://github.blog/2023-04-04-whats-new-with-github-sponsors/#organization-funded-sponsorships-now-generally-available

While in beta, we saw exciting growth in direct funding from 3,500 partner organizations like AWS, American Express, Shopify, and Mercedes Benz. And in 2022, nearly 40% of sponsorship funding came from organizations, with each organization-funded sponsorship worth on average nearly 15X more to maintainers than the average individual sponsorship.

Perhaps you @TsengSR just have no idea what you're talking about. That's another angle.

devlooped / SponsorLink

Replace hashed email with manifest-based offline check #31