Closed xflr6 closed 2 years ago
This looks like an upstream issue. I'll note a few things.
the-psf
table is recording roughly 2X the records as the bigquery-public-data
table since around that timeNote that even though the source code refers to the bigquery-public-data
table, I forgot to deploy these changes so the website is still pulling from the-psf
table.
Some queries:
Old table (in use)
SELECT
DATE(timestamp) AS dt,
COUNT(*) AS ct
FROM
`the-psf.pypi.file_downloads`
WHERE
DATE(timestamp) > '2021-11-20'
GROUP BY
dt
ORDER BY
dt ASC
returns
Row | dt | ct | |
---|---|---|---|
1 | 2021-11-21 | 277375804 | |
2 | 2021-11-22 | 460365153 | |
3 | 2021-11-23 | 1262828731 | |
4 | 2021-11-24 | 979330195 | |
5 | 2021-11-25 | 667038871 | |
6 | 2021-11-26 | 297595782 | |
7 | 2021-11-27 | 1663453 | |
8 | 2021-11-28 | 683178 |
New table (not in use)
SELECT
DATE(timestamp) AS dt,
COUNT(*) AS ct
FROM
`bigquery-public-data.pypi.file_downloads`
WHERE
DATE(timestamp) > '2021-11-20'
GROUP BY
dt
ORDER BY
dt ASC
returns
Row | dt | ct | |
---|---|---|---|
1 | 2021-11-21 | 277321963 | |
2 | 2021-11-22 | 441768868 | |
3 | 2021-11-23 | 467591439 | |
4 | 2021-11-24 | 447787347 | |
5 | 2021-11-25 | 400089780 | |
6 | 2021-11-26 | 163602965 | |
7 | 2021-11-27 | 780878 | |
8 | 2021-11-28 | 374384 |
Thanks for the quick response and analysis.
Note that even though the source code refers to the
bigquery-public-data
table, I forgot to deploy these changes so the website is still pulling fromthe-psf
table.
Would it be worth a try to check if the former does not have the issue (I switched to it in this notebook a while ago)?
P.S.: ignore me, now I see that both have an issue :)
Just finished resolving our data pipeline issues, but crucially you should migrate to consuming from the bigquery-public-data
dataset rather than the-psf
, and reprocess 2021-11-23 as well as all days since to get the most accurate data. we will likely not be backfilling the the-psf
dataset.
Thank you!
So when https://github.com/crflynn/pypistats.org/pull/39 is deployed, pypistats.org should be all set.
See e.g. https://pypistats.org/packages/pip