Telemetry pros/cons quality/security implications.

didactic-drunk commented 5 years ago

This is issue is continuing part of the discussion in #283.

Telemetry is a useful metric on the overall health/utility of the project. If it's being downloaded a lot then it's probably useful to someone.

Author feedback

Author feedback is incredibly important. Here's set of questions I'm currently dealing with where telemetry would give me immediate answers.

How do I know if **shameless plug** https://github.com/didactic-drunk/cox is useful to anyone else?
Should I provide a full featured interface because it's useful to others or stop when my needs are met?
Do I need to worry about breaking API compatibility lots of users or will 2 users be inconvenienced?
Right now I have 2 stars. Does that mean 2 people like what they saw and I can freely fix the API before anyone starts using it or there are 100 silent users for every star?
I can guess the answer because my contributions are new.
But what about the base library I based it off? It seems abandoned for over a year and has 14 stars.
How many those stars are actual users?
How many users are using it that didn't star it?
There are no public dependencies or reverse dependencies on github. Does that mean there are no users and I can break his API without worry or there are 14*n users of private or non-github applications that have a dependency on his library?
Several people forked the project and contributed which means it may have active users. Or not. "That wasn't a question." I didn't sign up for a game show.

This is a real example I'm dealing with right now and my conclusion is IDFK without telemetry.

Forks

Sometimes shards are abandoned by the author. Forks on occasion pop up that surpass the original. Right now all github/crystalshards.xyz/shards.info searches exclude forks, but with a small development community forks being more up to date than the original are fairly normal. Telemetry could tell you the at least author of a fork is using the shard vs someone who came by and wanted to learn crystal or implement something in crystal but never used it.

Telemetry could tell you "Hey that fork has users. Maybe I should send my contributions there or at least see why a particular fork is preferred over the others."

Publishing libraries for quality and hiding "only shards"

There was some talk about only listing published libraries. That's a terrible idea that will create more problems than it solves. If publishing is a listing requirement then new projects regardless of quality or status will follow the publishing requirements just like they do for academic journals. Wait, wat? Some Chinese Universities have a doctorate requirement of publishing a scientific study in English in a well known journal. It's real world effect increased junk science publications. I expect a similar effect with any additional shard publishing or listing requirements. There are plenty of other examples of with academic journals and people gaming the system. I'm not sure the incentives you attempt to put in place will do what you think they will.

Example:

Alice: A senior software engineer creates a crystal shard for libfoo. She only implements half of libfoo as the library is large, fulfilling the needs of her individual application. It's used in production, reliable and well tested. She puts her shard on github but never publishes a library thinking that isn't not 1.0 as only 50% of the methods are implemented.
Bob: A student learning programming searches for libfoo Either:
1. The package shows up on the official registry and maybe he continues contributing or maybe thinks "it's not a published library" and being overoptimistic of his abilities starts his own project.
2. Doesn't see Alice's project because the search refuses to show "only shards" and starts his own project unaware of Alice's work. Bob follows the publishing requirements barely implementing 10% of the API, shoddily.

What will users think when they search for libfoo and see either only Bob's work or Bob's work ahead of Alice's? With telemetry Alice's package would probably pick up traction by experienced developers, possibly getting forks with additional metrics to add weight. Without telemetry or reading the actual code Bob's looks better even though it's shit.

It gets worse with bad actors:

Mallory: A malware writer takes Alice's unlisted project and bits of Bob's, repackages it as "LibFooV2" fulfilling the publishing requirements, adds a fancy graphic and sits waiting for users to start using his package. After a year Mallory adds a backdoor and pulls something like this.

With download numbers you could immediately see Bob's or Alice's software is more popular and do a more thorough investigation or do what most people do and use the popular package.

Without download numbers shiny graphics and a v2 can quickly shift new users to malware.

Privacy

Privacy issues are easy to work around and no ip addresses need to be logged or traceable. Here are just a few options.

Open source the telemetry server and don't log ip addresses. Allow auditing by trusted maintainers.
Background connect to one of the various anonymous network relays or chat services to send telemetry.
Use DNS queries with a custom DNS server or log parser. Only the ip address of the caching dns servers will show, often as 1.1.1.1 for those using Google's DNS servers.
Post as a gist with a searchable string <- allows tracing back to the user, but not ip address.
Send via a anonymous remailer. No ip addresses are traceable through a remailer and it's simple.

The anonymization of telemetry requests gives an edge over other package managers that have full tracking such as rubygems and node and still provides useful feedback.

The security paranoid or corporate installations can easily opt out.

The average users experience.

Can you provide a single metric that most users will use to make safer decisions regardless of their knowledge level other than a download number?

Telemetry is often provided by package managers because it provides useful information for a variety of porpoises. They're often better than humans at coming up with new ideas. Maybe the data will be used for decision making by individuals that benefit the community but not in a way that is normally thought of. See Security Auditing below.

Network delays

Telemetry can easily be delay free. Fork a background process for telemetry requests redirected to /dev/null. If it fails so what. The user experience isn't hindered. I don't remember which of Node.js, Atom or both do something similar.

Security auditing

Download numbers were the most important detail when I identified a malware package in Firefox for a download manager gaining popularity. It had all the shine of a fancy download manager but the code contained a Base64 blob to send every link clicked through a remote server and eval()'d the response.

If there were metrics security professionals (like myself) could choose shards to audit enhancing community security. Without the metrics I'm less likely to audit software as I don't know if my time is wasted on 0 download shards.

@watzon you're not Bob. You're not Alice either. If Chris is short for Christine then we can resolve our differences over an extra salty dinner and a finger

watzon commented 5 years ago

That last part confuses me :flushed:

didactic-drunk commented 5 years ago

Referencing https://github.com/watzon/nacl/issues/1.

ysbaddaden commented 5 years ago

There are much better ways to deal with that:

Lookup the Git history.

was the project created yesterday?
was the project abandoned (no Git activity) then suddenly a release by a new author (fishy);
no activity for months, years? either it's very stable or its abandoned;

Lookup dependents: is the library used by many other libraries?

Downloads are just a number, they tell nothing. A project source history is much more interesting.

didactic-drunk commented 5 years ago

There are much better ways to deal with that:
1. Lookup the Git history.

Does the average user do that? Often not. I don't even do that every time.

* was the project created yesterday?

That doesn't handle the Mallory case where he/she sits and waits for users. After a year the project will be well aged and look "stable".

* was the project abandoned (no Git activity) then suddenly a release by a new author (fishy);

How does that look different from someone forking a dead project?

In my example with download numbers the original project has a much greater amount of users which makes the Mallory project look fishy regardless of age.

In your example without download numbers they look the same except for age. The Mallory example above would start switching users after the "age" where it doesn't look fishy which is up to the individual. Some would be fooled on day 1.

* no activity for months, years? either it's very stable or its abandoned;

Downloads allows you to differentiate between the 2.

With downloads it's stable and everyone can see it.

Without downloads who knows. Maybe stable or abandoned. That was part of my set of questions of outlined in Author feedback.

1. Lookup dependents: is the library used by many other libraries?

What if it's an app? What if it's an end dependency used by apps?

I have the stats for dependencies. Most libraries have 0 dependencies but some number of those I know are actively maintained and in use by private repos. I know because I'm using them and they show up as 0 dependencies for the > 900 crystal projects on github.

Downloads are just a number, they tell nothing. A project source history is much more interesting.

Age is just a number but you implied it tells something. Downloads can tell a lot as I've carefully outlined. Age and downloads often go together, especially with a graph over time.

didactic-drunk commented 5 years ago

@ysbaddaden I understand you don't like download statistics. I'm outlining where, how and why it's useful.

It's useful to authors of shards (See Author feedback).
It's useful to less knowledgeable developers to avoid malicious software (It's also useful in general for this purpose).
It's useful to quickly see what other people already found useful rather than conducting your own thorough investigation of every possible package.
It's useful to see how well supported a package is likely to be as a single number. Lots of users means lots of people who don't want to see it break. Age doesn't tell you that. Neither does git history without a lot of scrutiny and it's still not the same metric.
It's useful to security auditors (myself) to see where I should put my time.

didactic-drunk commented 5 years ago

@ysbaddaden You didn't address a single issue of Author feedback and how you could possibly get any of that information from age or git history. My questions were designed to show that no metric except downloads tells you if the package is in use or not. I'm aware of corner cases where downloads won't tell how many uses but still gives a minimum.

Right now my number if known users is 1 (myself). How do I identify other potential users of the package I'm attempting to maintain without download statistics? Dependencies won't help, they don't show private repo usage.

crystal-lang / shards