Exodus-Privacy / etip

εxodus tracker investigation platform
https://etip.exodus-privacy.eu.org/
GNU Affero General Public License v3.0
53 stars 19 forks source link

Rewrite ETIP as a GitHub repository #133

Open pnu-s opened 2 years ago

pnu-s commented 2 years ago

After a long thinking and experiencing various pain with the current version of ETIP (as well as seeing the same pain from many contributors), I'm wondering whether we could/should revamp totally ETIP another way.

Hear me out: what about switching to a GitHub repository with a specific file for each tracker. I'd expect Markdown or JSON format (but preferably Markdown to ease the review).

Name of the repository could be: https://github.com/Exodus-Privacy/trackers

Advantages:

Drawbacks:

This is a major change so I would love to hear your opinions @U039b @jawz101 @eighthave @blaueente @IzzySoft (feel free to tag any potentially interested person)

jfoucry commented 2 years ago

What about the existing database of trackers in ETIP? Did we need to create a script in order to, for each tracker in the database, create a PR?

pnu-s commented 2 years ago

Not to create a PR, but indeed we would need to create a script to migrate our existing trackers to the new format.

That goes into this point:

This requires a significant amount of work (writing scripts mostly) but I'm up for it if we go that way

eighthave commented 2 years ago

I think its a great idea and can help write scripts and setup CI. I recommend using YAML for this, it is basically JSON that is meant to be human edited (indeed all valid JSON is valid YAML). YAML is also widely understood since it is used GitHub Actions, FUNDING.yml, gitlab-ci.yml, travis-ci.yml, F-Droid's metadata .yml format, and many more.

I think it should probably use StrictYAML to make human editing easier. For example, version: 1.1 vs version: 1.1.0 would always be parsed as the strings "1.1" and "1.1.0" while in plain YAML, it would be a float 1.1 and a string "1.1.0".

FestplattenSchnitzel commented 2 years ago

This sounds like a very good idea! Without much knowledge about the existing setup, I'd guess this change would make that data a lot easier to read for machines and thus enables other projects (like F-Droid) to use it as well.

Each new contribution (tracker creation, tracker modification, tracker validation) can be validated within the PR

If you use YAML or JSON you can use a JSON schema for validation.

Everyone is required to have a GitHub account to contribute

It'd be great to see you on federated Gitea when it's there.

Speaking of submission: https://www.datenanfragen.de/ / https://www.datarequests.org does provide a web form [0] that will create a PR with a JSON file (e.g. [1]) at GitHub for example.

[0] - German : https://www.datenanfragen.de/suggest/#!type=new&for=cdb [0] - English : https://www.datarequests.org/suggest/#!type=new&for=cdb [1] : https://github.com/datenanfragen/data/pull/1646

jawz101 commented 2 years ago

I agree. A change at this point should serve multiple purposes. If it is easier to administer as well as easier to consume into Exodus. I don't know what the bottleneck is that caused the 200 tracker signatures to pile up but if it is a delay in moving it from ETIP into the machine-readable formats of Exodus itself, tracker signature finders can certainly step up and write things in a format that is more consumable and less hands off. Or if we need to set up test environments and see how test apk's handle the signatures- I'm fine with that.

I just want to get the current backlog whittled down and let that process dictate how a new system could introduce improvement. My only concern with Github is then another bottleneck is introduced because pull requests get sat on and people get caught up in a back & forth discussion about a tracker signature instead of having someone with the interest to implement new tracker definitions. I wouldn't think the implementer should immediately add every tracker as they are submitted. Rather, wait for 20 or so to pile up and then do the same operation to implement several at a time. If the problem is submitters leave fields blank or the regex isn't correct, then we need required fields and a note that it needs to be in a particular format for the implementers to integrate it. Just a note by the field. It doesn't need to be some fancy syntax checker.

pnu-s commented 2 years ago

Thanks for your inputs @jawz101 !

I share your concern to reduce the current backlog. I just added a couple of (very) minor changes to ETIP to ease the review and we had a meeting this week within the organization to try to put more (volunteer) people into this task.

Actually, moving the trackers from ETIP to exodus is probably the only thing which works really well (and it's automated so requires very little human time).

What I miss the most in the current version of ETIP is:

What happens for most trackers currently in the backlog:

  1. The tracker profile is fine, it just needs some people to review it -> that do happen, but as I said we try to engage more people into this review task
  2. The tracker profile does not match our definition of a tracker: either a webchat sdk, a gaming development sdk, an identifier generation sdk, etc. -> currently we do not know really well how to treat those
  3. The tracker profile is fine but the signature does not match any report in exodus (0 match) -> that happens a lot, and it's very hard for us to validate the signature if there is 0 match

I would say that the case 3 is the most common, then 2 then 1.

I'm thinking that moving to a code repository would ease the discussion between submitters and reviewers, and allow us to not let a huge backlog like the current one happen. But I can be wrong, this won't solve every issue of ours.

And yes, we probably need to tackle the backlog before moving to a new system.

jawz101 commented 2 years ago

Perhaps something that indicates "needs more information" if there is a question about if it is indeed a tracker or not. I still like seeing that a signature is in there even if it does not fit the definition of a tracker because it would likely come up again. I mainly look for technical documentation if it is publicly accessible which tells me that it must be in some application somewhere at least at one point. Since there is not a convenient way to upload unknown apks directly from the phone, a cumbersome part of the submission process is having to go to the Exodus site with a package name in mind and upload it. And with the library only representing a 80,000 or so apps that leaves a large chunk unchecked.

But yeah, the ETIP website did seem like a lot of effort to invest rather than using something like Github. Though if it functions on the backend with a database, that has its own conveniences.

eighthave commented 2 years ago

I have some time to work on this, so I started sketching it out. Here is the first stab at a YAML conversion, it definitely needs work, but it is a good place to continue to conversation: https://github.com/eighthave/etip/tree/yaml-conversion/trackers

@pnu-s did you have time to work on this at all? If you have code for getting the data out of the database, I'm happy to work on getting it nicely outputted to YAML. I've been working from the JSON from https://reports.exodus-privacy.eu.org/api/trackers

eighthave commented 2 years ago

I was just working with @Miriam-cpu / mobilsicher.de and we thought that we could standardize on a data format here that would work for:

I think we can clearly use the same code data fields and structures, and additional project-specific fields can be added as needed without conflicting with these core fields. This works well when the base data structure is a dictionary. The only notable difference I can think of between these lists would be that Exodus and F-Droid's network_signatures lists mark the problematic domains while Mobil Sicher's third party networks list needs to list the "good" domains, then any other domain found would be considered "third party".

pnu-s commented 2 years ago

@eighthave Thanks for the work you've put into this!

To be honest, we put our recent efforts about ETIP into adding new features to its current form, for instance to make it more explicit why some trackers are not accepted into εxodus yet (which is our main problematic at the moment).

Rewriting ETIP would cost us, and I'm not entirely convinced that we would win more than lose in terms of ease of use and of features. That can obviously still be discussed and is not a final decision, but we decided to still invest into ETIP's current form.

This being said, we are obviously open to discuss about the data format for trackers, and about changes in ETIP UI, JSON export format or εxodus JSON API response.

eighthave commented 2 years ago

Can you point me to the new ETIP work? I couldn't find anything.

I'm still convinced that managing the ETIP/Exodus process via files and pull requests will make it easier to follow the work, and contribute to it. Millions of people are familiar with the git workflow at this point, so that alone means it is easier for people to follow. I have time to work on building this out, and we're going to do it anyway for the F-Droid.org proprietary libs list, and probably also the mobilsicher.de third-party list

pnu-s commented 2 years ago

Can you point me to the new ETIP work? I couldn't find anything.

What I meant is that we added a couple of new features, such as the number of matches in exodus and the new badge for each tracker, which easily show why a tracker is not added to εxodus yet

I'm still convinced that managing the ETIP/Exodus process via files and pull requests will make it easier to follow the work

I have mixed feelings about this, mostly because we would lose all the efforts we have made to the current form of ETIP (such as the automated integration of trackers from ETIP to εxodus, which would need to be rewritten).

But I obviously see some benefits (otherwise I would not have create this issue in the first place :smile:)

we're going to do it anyway for the F-Droid.org proprietary libs list, and probably also the mobilsicher.de third-party list

What do you imagine here? Do you think we could have a unique repository managed by multiple organizations?

eighthave commented 2 years ago

Can you point me to the new ETIP work? I couldn't find anything.

What I meant is that we added a couple of new features, such as the number of matches in exodus and the new badge for each tracker, which easily show why a tracker is not added to εxodus yet

Where can I see that?

I'm still convinced that managing the ETIP/Exodus process via files and pull requests will make it easier to follow the work

I have mixed feelings about this, mostly because we would lose all the efforts we have made to the current form of ETIP (such as the automated integration of trackers from ETIP to εxodus, which would need to be rewritten).

But I obviously see some benefits (otherwise I would not have create this issue in the first place smile)

If you point me to the code that does that integration, I can look and see if I can handle the porting.

we're going to do it anyway for the F-Droid.org proprietary libs list, and probably also the mobilsicher.de third-party list

What do you imagine here? Do you think we could have a unique repository managed by multiple organizations?

I think it is possible, as long as we can find agreement on how it should be maintained. I'm talking with mobilsicher.de and @izzysoft about how to make this happen. mobilsicher.de currently maintains their own list, and @IzzySoft's library list is here in JSON Lines format: https://gitlab.com/IzzyOnDroid/repo/-/blob/master/lib/libinfo.txt

eighthave commented 2 years ago

I just put together some examples to start thinking about this more: https://gitlab.com/eighthave/proprietary-libs-list/-/tree/main/profiles

I don't yet see a clear logic to how the libraries are grouped. I think ETIP groups them more or less by "product" as defined by the companies that release it. Now that I've gone through this more, I think fdroid scanner would need things to be grouped by Anti-Feature. So basically, each profile would include:

anti_features:
  - NonFreeDep
  - Tracking

Then all of the code_signatures: entries should mean that something that contains all of those Anti-Features was found. Otherwise, we'd need some mapping of signature to Anti-Feature.

eighthave commented 2 years ago

After sleeping on this, I think we can actually leave the grouping pretty open because it should be fine if multiple profiles match a given library every now and then. These profiles are ultimately about showing info to a human, so multiple hits for a single library should be fine.

eighthave commented 1 year ago

You can see the first version of F-Droid rewriting its signature profiles as a git repo of YAML files now. We call is "suss" https://gitlab.com/fdroid/fdroid-suss

eighthave commented 1 year ago

Here's more on F-Droid's work on a YAML/git setup for signature profiles: