crflynn / pypistats.org

PyPI downloads analytics dashboard
https://pypistats.org/
139 stars 10 forks source link

On November 26 a lot more `null` Entries appeared #13

Closed cooperlees closed 4 years ago

cooperlees commented 5 years ago

What can cause null entries on the:

What is the root cause of this on bandersnatch specifically? https://pypistats.org/packages/bandersnatch

Is it bad data to Google?

crflynn commented 5 years ago

In my experience, the null fields in the aggregations of download records are mostly due to downloads made directly by the requests library, in lieu of using pip or bandersnatch itself or other dependency management client that wraps pip and/or properly sets a parsable user-agent header.

I've done a similar investigation on two small utility packages I maintain databricks-dbapi and databricks-api which had a few thousands of downloads per day for a several days made by the requests client.

A basic query on BigQuery shows that significant requests installs started on Nov 26 for bandersnatch, so it appears to be the culprit here, as well. What I also noticed is that these jumps correspond with the latest releases of bandersnatch. New releases usually correspond with jumps in mirror downloads as you know, but I still don't know why it would cause requests-based installs to jump also.

That being said, I'm not really sure why someone would be downloading these packages with the requests library over something like pip or bandersnatch. My best guess is that it could be something scraping pypi on regular intervals like some software archive or some security research download automation (?). I honestly don't know.

As an aside, I did some digging about how exactly the records are generated. Working with bandersnatch, you might know these details already but I'm going to put my findings here (for my own recollection at least):

Linehaul populates record fields by parsing the user-agent header here from the pypi logs. Pip sets the user-agent here which is what makes these aggregations possible.

It looks like pipenv wraps pip for installs here. Poetry also uses the venv's pip for installs under the hood. Similarly bandersnatch sets it here

I guess some follow-up questions here might be

crflynn commented 5 years ago
SELECT
  details.installer.name,
  details.installer.version,
  count(*)
FROM
  `the-psf.pypi.downloads20181125`
WHERE
  file.project = 'bandersnatch'
GROUP BY
  1, 2
order by
  3 desc

11/25

name version count
pip 1.5.4 48
27
bandersnatch 2.2.1 4
pip 18.1 1

11/26

name version count
bandersnatch 2.0.0 274
bandersnatch 1.11 124
bandersnatch 2.2.1 100
requests 2.19.1 83
pip 1.5.4 50
bandersnatch 3.0.1 44
bandersnatch 2.2.0 32
pip 8.1.2 31
pip 18.1 24
Browser 18
bandersnatch 3.0.0.dev0 16
bandersnatch 2.1.3 8
pip 9.0.1 5
bandersnatch 3.1.0 4
bandersnatch 3.1.1 4
bandersnatch 1.1 4
4
bandersnatch 3.1.0.dev1 4
bandersnatch 1.4 4
pip 10.0.1 3
requests 2.6.0 2
pip 9.0.3 2
requests 2.13.0 1
pip 18 1
cooperlees commented 5 years ago

Thanks for the detailed reply! Should we do a PR to requests and see if they'll accept adding the Python runtime into the default User Agent? I could try if you wish.

From:

To:

Via:

import sys
python = sys.implementation.name
python += " {}.{}.{}-{}{}".format(*sys.version_info)

Thoughts?

hugovk commented 5 years ago

There's also archivers like https://www.softwareheritage.org/2018/10/10/pypi-available-on-software-heritage/ (I've not checked this one in detail).

cooperlees commented 5 years ago

Well, requests is not going to add Python version. I understand kernel being risky, but don't see Python version as risky (thus, why I removed kernel from bandersnatch). We're going to be left shooting up a dark tunnel here with requests. O well, need to try hunt down major PyPI users and see if they'll set a nice User Agent.

crflynn commented 5 years ago

I think you're right. It's difficult to tell what the purpose of the requests-based downloads is. If the downloads are related to software archives, I would prefer to filter them from pypistats aggregations. On the other hand if they are part of a newer package management tool, then I would try to encourage that tool to either wrap pip or use a more detailed user-agent as they should be included as user downloads. This discussion more or less prompts the question of whether to restrict the segmented aggregations to pip only. I'll have to do some research on exactly which proportion of the downloads are actually null-valued due to requests as the agent.