crflynn / pypistats.org

PyPI downloads analytics dashboard
https://pypistats.org/
138 stars 10 forks source link

Ignore Metadata Files From Download Stats #66

Open jonathan343 opened 1 month ago

jonathan343 commented 1 month ago

Issue

After an unrelated investigation into the boto3 package download stats, I noticed that a significant portion of them included files like boto3-1.xx.xx-py3-none-any.whl.metadata. Using the publicly available dataset, I was able to run some queries and found that these metadata files accounted for ~18.55% of our "downloads" (query and results provided below).

Request

Ignore files like *.whl.metadata since including them results in metrics that do not accurately reflect end-user downloads.

SQL Query

#standardSQL
SELECT
  COUNT(CASE WHEN file.filename LIKE '%.whl.metadata' THEN 1 END) AS whl_metadata_downloads,
  COUNT(CASE WHEN file.filename LIKE '%.whl' THEN 1 END) AS whl_downloads,
  COUNT(CASE WHEN file.filename LIKE '%.tar.gz' THEN 1 END) AS source_downloads,
  COUNT(*) AS total_downloads,
FROM
  `bigquery-public-data.pypi.file_downloads`
WHERE
  -- Query information for the boto3 project
  file.project = 'boto3'

  -- Only query the last 30 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    AND CURRENT_DATE()

  -- Only consider downloads using pip
  AND details.installer.name = 'pip'

Results

whl_metadata_downloads whl_downloads source_downloads total_downloads
263675949 1157843064 56343 1421575356

263675949 / 1421575356 * 100 = ~18.55%

Additional Information

The *.whl.metadata files were introduced in PEP 658 as a way for package managers to “to inspect distribution metadata without intending to install the distribution”. This was integrated into pip in version 22.3.