crflynn / pypistats.org

PyPI downloads analytics dashboard
https://pypistats.org/
140 stars 10 forks source link

Download stats dropped close to 0 since 2011-11-24? #44

Closed xflr6 closed 2 years ago

xflr6 commented 2 years ago

See e.g. https://pypistats.org/packages/pip

spam

crflynn commented 2 years ago

This looks like an upstream issue. I'll note a few things.

  1. Downloads appear to start getting weird on 11-22/11-23
  2. the-psf table is recording roughly 2X the records as the bigquery-public-data table since around that time
  3. Significant data loss starting on 11/26

Note that even though the source code refers to the bigquery-public-data table, I forgot to deploy these changes so the website is still pulling from the-psf table.

Some queries:

Old table (in use)

SELECT
  DATE(timestamp) AS dt,
  COUNT(*) AS ct
FROM
  `the-psf.pypi.file_downloads`
WHERE
  DATE(timestamp) > '2021-11-20'
GROUP BY
  dt
ORDER BY
  dt ASC

returns

Row dt ct  
1 2021-11-21 277375804  
2 2021-11-22 460365153  
3 2021-11-23 1262828731  
4 2021-11-24 979330195  
5 2021-11-25 667038871  
6 2021-11-26 297595782  
7 2021-11-27 1663453  
8 2021-11-28 683178

New table (not in use)

SELECT
  DATE(timestamp) AS dt,
  COUNT(*) AS ct
FROM
  `bigquery-public-data.pypi.file_downloads`
WHERE
  DATE(timestamp) > '2021-11-20'
GROUP BY
  dt
ORDER BY
  dt ASC

returns

Row dt ct  
1 2021-11-21 277321963  
2 2021-11-22 441768868  
3 2021-11-23 467591439  
4 2021-11-24 447787347  
5 2021-11-25 400089780  
6 2021-11-26 163602965  
7 2021-11-27 780878  
8 2021-11-28 374384
xflr6 commented 2 years ago

Thanks for the quick response and analysis.

Note that even though the source code refers to the bigquery-public-data table, I forgot to deploy these changes so the website is still pulling from the-psf table.

Would it be worth a try to check if the former does not have the issue (I switched to it in this notebook a while ago)?

P.S.: ignore me, now I see that both have an issue :)

crflynn commented 2 years ago

Related: https://status.python.org/incidents/2jj696st6yn5

ewdurbin commented 2 years ago

Just finished resolving our data pipeline issues, but crucially you should migrate to consuming from the bigquery-public-data dataset rather than the-psf, and reprocess 2021-11-23 as well as all days since to get the most accurate data. we will likely not be backfilling the the-psf dataset.

hugovk commented 2 years ago

Thank you!

So when https://github.com/crflynn/pypistats.org/pull/39 is deployed, pypistats.org should be all set.