crystal-lang / shards

Dependency manager for the Crystal language
Other
762 stars 99 forks source link

Telemetry pros/cons quality/security implications. #284

Closed didactic-drunk closed 5 years ago

didactic-drunk commented 5 years ago

This is issue is continuing part of the discussion in #283.

Telemetry is a useful metric on the overall health/utility of the project. If it's being downloaded a lot then it's probably useful to someone.

Author feedback

Author feedback is incredibly important. Here's set of questions I'm currently dealing with where telemetry would give me immediate answers.

This is a real example I'm dealing with right now and my conclusion is IDFK without telemetry.

Forks

Sometimes shards are abandoned by the author. Forks on occasion pop up that surpass the original. Right now all github/crystalshards.xyz/shards.info searches exclude forks, but with a small development community forks being more up to date than the original are fairly normal. Telemetry could tell you the at least author of a fork is using the shard vs someone who came by and wanted to learn crystal or implement something in crystal but never used it.

Telemetry could tell you "Hey that fork has users. Maybe I should send my contributions there or at least see why a particular fork is preferred over the others."

Publishing libraries for quality and hiding "only shards"

There was some talk about only listing published libraries. That's a terrible idea that will create more problems than it solves. If publishing is a listing requirement then new projects regardless of quality or status will follow the publishing requirements just like they do for academic journals. Wait, wat? Some Chinese Universities have a doctorate requirement of publishing a scientific study in English in a well known journal. It's real world effect increased junk science publications. I expect a similar effect with any additional shard publishing or listing requirements. There are plenty of other examples of with academic journals and people gaming the system. I'm not sure the incentives you attempt to put in place will do what you think they will.

Example:

What will users think when they search for libfoo and see either only Bob's work or Bob's work ahead of Alice's? With telemetry Alice's package would probably pick up traction by experienced developers, possibly getting forks with additional metrics to add weight. Without telemetry or reading the actual code Bob's looks better even though it's shit.

It gets worse with bad actors:

With download numbers you could immediately see Bob's or Alice's software is more popular and do a more thorough investigation or do what most people do and use the popular package.

Without download numbers shiny graphics and a v2 can quickly shift new users to malware.

Privacy

Privacy issues are easy to work around and no ip addresses need to be logged or traceable. Here are just a few options.

The anonymization of telemetry requests gives an edge over other package managers that have full tracking such as rubygems and node and still provides useful feedback.

The security paranoid or corporate installations can easily opt out.

The average users experience.

Can you provide a single metric that most users will use to make safer decisions regardless of their knowledge level other than a download number?

Telemetry is often provided by package managers because it provides useful information for a variety of porpoises. They're often better than humans at coming up with new ideas. Maybe the data will be used for decision making by individuals that benefit the community but not in a way that is normally thought of. See Security Auditing below.

Network delays

Telemetry can easily be delay free. Fork a background process for telemetry requests redirected to /dev/null. If it fails so what. The user experience isn't hindered. I don't remember which of Node.js, Atom or both do something similar.

Security auditing

Download numbers were the most important detail when I identified a malware package in Firefox for a download manager gaining popularity. It had all the shine of a fancy download manager but the code contained a Base64 blob to send every link clicked through a remote server and eval()'d the response.

If there were metrics security professionals (like myself) could choose shards to audit enhancing community security. Without the metrics I'm less likely to audit software as I don't know if my time is wasted on 0 download shards.

@watzon you're not Bob. You're not Alice either. If Chris is short for Christine then we can resolve our differences over an extra salty dinner and a finger

watzon commented 5 years ago

That last part confuses me :flushed:

didactic-drunk commented 5 years ago

Referencing https://github.com/watzon/nacl/issues/1.

ysbaddaden commented 5 years ago

There are much better ways to deal with that:

  1. Lookup the Git history.
  1. Lookup dependents: is the library used by many other libraries?

Downloads are just a number, they tell nothing. A project source history is much more interesting.

didactic-drunk commented 5 years ago

There are much better ways to deal with that:

1. Lookup the Git history.

Does the average user do that? Often not. I don't even do that every time.

* was the project created yesterday?

That doesn't handle the Mallory case where he/she sits and waits for users. After a year the project will be well aged and look "stable".

* was the project abandoned (no Git activity) then suddenly a release by a new author (fishy);

How does that look different from someone forking a dead project?

In my example with download numbers the original project has a much greater amount of users which makes the Mallory project look fishy regardless of age.

In your example without download numbers they look the same except for age. The Mallory example above would start switching users after the "age" where it doesn't look fishy which is up to the individual. Some would be fooled on day 1.

* no activity for months, years? either it's very stable or its abandoned;

Downloads allows you to differentiate between the 2.

With downloads it's stable and everyone can see it.

Without downloads who knows. Maybe stable or abandoned. That was part of my set of questions of outlined in Author feedback.

1. Lookup dependents: is the library used by many other libraries?

What if it's an app? What if it's an end dependency used by apps?

I have the stats for dependencies. Most libraries have 0 dependencies but some number of those I know are actively maintained and in use by private repos. I know because I'm using them and they show up as 0 dependencies for the > 900 crystal projects on github.

Downloads are just a number, they tell nothing. A project source history is much more interesting.

Age is just a number but you implied it tells something. Downloads can tell a lot as I've carefully outlined. Age and downloads often go together, especially with a graph over time.

didactic-drunk commented 5 years ago

@ysbaddaden I understand you don't like download statistics. I'm outlining where, how and why it's useful.

didactic-drunk commented 5 years ago

@ysbaddaden You didn't address a single issue of Author feedback and how you could possibly get any of that information from age or git history. My questions were designed to show that no metric except downloads tells you if the package is in use or not. I'm aware of corner cases where downloads won't tell how many uses but still gives a minimum.

Right now my number if known users is 1 (myself). How do I identify other potential users of the package I'm attempting to maintain without download statistics? Dependencies won't help, they don't show private repo usage.