ContinuumIO / anaconda-package-data

Conda package download data
Creative Commons Attribution 4.0 International
98 stars 36 forks source link

Include `.conda` packages #45

Open jakirkham opened 1 year ago

jakirkham commented 1 year ago

It would be helpful to include both .conda & .tar.bz2 packages. Particularly as more of the former and less of the latter are produced. May also help to track these separately to track the transition to the newer format

jakirkham commented 1 year ago

cc @beckermr @wolfv

jezdez commented 11 months ago

Looking into this with @cappadona

dopplershift commented 10 months ago

@jezdez Did that go anywhere? I was working on collecting some download numbers for my library and right now 2023 shows minimal downloads due to the transition to .conda.

jakirkham commented 10 months ago

@jezdez did this issue get solved more broadly?

Saw the python packages were fixed recently: https://github.com/ContinuumIO/anaconda-package-data/issues/41

Is there a path for fixing the other packages? Or did this already happen?

cappadona commented 10 months ago

@jakirkham @dopplershift. Apologies for the delay.

We have not yet addressed .conda packages missing from this data set. This work is on our backlog, and we should be able to get this done in November. We will provide updates here, but please don't hesitate to reach out with questions.

jakirkham commented 10 months ago

Thanks Nick! πŸ™

cappadona commented 7 months ago

Hi @jakirkham @dopplershift. Quick update on the status of this issue.

We're working on finalizing a new pipeline that will source this public data set and include .conda packages moving forward. We expect to have it ready by the end of March 2024 and will post an update here when it is available.

leofang commented 7 months ago

Hi @cappadona Thanks for the update! Q: Would it be possible to also update the past statistics when the new pipeline is up?

cappadona commented 7 months ago

@leofang At the moment we're not planning to replace any existing files in the bucket and only implement the fix for future data.

jakirkham commented 7 months ago

cc @aterrel @chenghlee (as we discussed this earlier)

leofang commented 5 months ago

Hi @cappadona @jezdez Friendly nudge for updates πŸ™‚ This has impacted several statistics tracking tools and caused confusion. I've heard jabbering about "no one is using conda" as they looked at the download counts from, say, condastats, but it is simply not true.

cappadona commented 5 months ago

Hi @leofang. Thanks for checking in. We are on track to include .conda packages in the dataset by the end of the month.

jakirkham commented 5 months ago

Just wanted to check in, @cappadona how are things looking here?

wolfv commented 5 months ago

Still looks reaaaally flat: https://prefix.dev/channels/conda-forge/packages/aesara (picked a random package)

jakirkham commented 5 months ago

To be fair, Nick said end of the month originally. So end of next week

Though would be good to learn if that is still true or if this is likely to slip

jakirkham commented 4 months ago

@cappadona how are things looking?

cappadona commented 4 months ago

@jakirkham Sorry I missed your earlier message. Thanks for checking in. We're looking good and the March 2024 data published to the s3 bucket later this week will include .conda packages.

I will post an update to this thread once the March data is available.

jakirkham commented 4 months ago

Thanks Nick! πŸ™

cappadona commented 4 months ago

Hi all. Quick update. We're just about there. Finalizing QA with the rest of the team, including a colleague who returns next week. Here are a couple examples for March 2024.

Screenshot 2024-04-05 at 5 17 12 PM Screenshot 2024-04-05 at 5 20 53 PM
jakirkham commented 4 months ago

Thanks Nick! πŸ™

With numpy this includes some older versions like 1.9.2, are these coming from defaults? Asking as conda-forge jumped to numpy version 1.9.3 (in the 1.9 series). Or is this an amalgamation of different channel statistics?

aesara is only in conda-forge AFAIK. So am guessing the top sheet is based on conda-forge data. Is that right?

cappadona commented 4 months ago

Hi @jakirkham. The screenshot is an aggregation of multiple channels, which are usually identified in the final dataset via the data_source column. I did confirm that conda-forge is the only data sources for aesara.

jakirkham commented 4 months ago

How are things looking @cappadona ?

jakirkham commented 3 months ago

@cappadona are there any updates here?

Also as a side note, users are also asking about March data in this issue: https://github.com/ContinuumIO/anaconda-package-data/issues/51

cappadona commented 3 months ago

Hi @jakirkham. Monthly and hourly data for March and April 2024, which includes .conda packages, are now available in the bucket.

Thank you all for your patience.

jezdez commented 3 months ago

@cappadona Do you think we could update the old files as well, since .conda files had been hosted for a while? Should we keep this ticket open until we fix that?

wolfv commented 3 months ago

So just to get it right, the format of the parquet files changed?

wolfv commented 3 months ago

Neither this command:

condastats overall pandas --start_month 2019-01 --end_month 2019-03 --monthly

Nor

condastats overall pandas --start_month 2024-01 --end_month 2024-03 --monthly

seem to work (both fail with FileNotFoundError: anaconda-package-data/conda/monthly/2024/2024-01.parquet). Did something else change? Note that these were taken from the official anaconda blog: https://www.anaconda.com/blog/get-python-package-download-statistics-with-condastats

I also tried to get the parquet file locally:

import pandas as pd

year = 2024
month = 4

s = f's3://anaconda-package-data/conda/monthly/{year}/{month:02}/{year}-{month:02}.parquet'

pd.read_parquet(s)

But it also fails because it can't find the file.

On our server, it seems to have downloaded the file at least at some point, btu the download counts were not updated (maybe because there is a new column that we don't take into account).

jakirkham commented 3 months ago

Thanks Nick! πŸ™

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: https://github.com/sophiamyang/condastats/issues/20

Maybe this is due to the same issue Wolf pointed out above?

jezdez commented 3 months ago

I'll ask Sophia to move the project into the conda-incubator, so we can fix it

Edit: https://github.com/sophiamyang/condastats/issues/21

jezdez commented 3 months ago

Nick is out currently and will pick the topic back up when he's back.

wolfv commented 3 months ago

And should the following work?

aws s3 cp s3://anaconda-package-data/conda/monthly/2024/04/2024-04.parquet ./

?

wolfv commented 3 months ago

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

wolfv commented 3 months ago

I double checked with the conda-forge scripts and none of the historic download counts seem to be publicly available:

Screenshot 2024-05-15 at 19 34 58

Error:

ClientConnectorError: Cannot connect to host anaconda-package-data.s3.weur.amazonaws.com:443 ssl:default [nodename nor servname provided, or not known]

Running this Notebook: https://github.com/conda-forge/by-the-numbers/blob/main/total%20downloads.ipynb

cappadona commented 3 months ago

Hi. I'm back and catching up...I think there are a couple different issues at play here.

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?

aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request

Thanks Nick! πŸ™

So I tried condastats overall pandas --start_month 2024-03 --end_month 2024-04 --monthly

Though got an error from condastats: sophiamyang/condastats#20

Maybe this is due to the same issue Wolf pointed out above?

@jakirkham I'm able to reproduce this and it looks like the new parquet files for March and April 2024 are missing some pandas specific properties in the file metadata that are expected by condastats.

aterrel commented 3 months ago

confirmed the data is available:

Screenshot 2024-05-20 at 10 24 14β€―AM
jakirkham commented 3 months ago

@aterrel that appears to be 2023 data. Were you able to load 2024 data from April or March?

aterrel commented 3 months ago

I do see data for 2024-04

Screenshot 2024-05-21 at 10 33 04β€―AM
cappadona commented 2 months ago

Hi @wolfv. Any luck on your end?

I just tried to run aws s3 ls s3://anaconda-package-data/conda/monthly/ and that also doesn't work. Can you help me @cappadona? Am I doing something wrong? Is this supposed to work? Or not supported anymore?

@wolfv Can you give this another try with the --no-sign-request option?

aws s3 ls s3://anaconda-package-data/conda/monthly/ --no-sign-request
wolfv commented 2 months ago

Yep, the data is back:

Screenshot 2024-05-29 at 19 01 41
jakirkham commented 2 months ago

Looking @aterrel 's plot above am curious why 12.3 doesn't show up. Is this an issue in the data or the code for the plot?

jakirkham commented 2 months ago

Tried generating my own script to parse through the data. Am seeing the following download counts for cudatoolkit (legacy package for CUDA 11 and earlier) and cuda-version (used in CUDA 12 and later)

```python #!/usr/bin/env python import packaging import sys from packaging.version import InvalidVersion, Version import matplotlib.pyplot as plt import pandas as pd plt.rcParams["figure.figsize"] = (22, 5) def main(*argv): pkgs = [ ("cudatoolkit", lambda v: Version("11.2") <= v < Version("12")), ("cuda-version", lambda v: Version("12") <= v and str(v) != "12.0.0"), ] for each_pkg, keep_filter in pkgs: year = "2024" month = "04" df = pd.read_parquet(f"{year}-{month}.parquet") df_pkg = df[df["pkg_name"] == each_pkg] pkg_vers = [] for v in df_pkg["pkg_version"].unique(): try: v = Version(v) except InvalidVersion: # Skip invalid version formats continue pkg_vers.append(v) pkg_vers = sorted(pkg_vers) pkg_vers_filt = list(filter(keep_filter, pkg_vers)) df_pkg_sorted = pd.concat( [df_pkg[df_pkg["pkg_version"] == str(v)] for v in pkg_vers_filt] ) df_pkg_plot = df_pkg_sorted[["pkg_version", "counts"]] df_pkg_plot["counts"] = df_pkg_plot["counts"] / 1e6 plt.clf() plt.bar(df_pkg_plot["pkg_version"], df_pkg_plot["counts"]) plt.title(f"{each_pkg} versions vs. Downloads (millions) for {year}-{month}") plt.xlabel(f"{each_pkg} versions") plt.ylabel("Downloads (millions)") plt.savefig(f"{each_pkg}_download_count.svg") return 0 if __name__ == "__main__": sys.exit(main(*sys.argv)) ```

Here are the results it shows (note values are in millions):

cudatoolkit_download_count

cuda-version_download_count

Admittedly this is only one month

Plus some packages built with CUDA support link to the driver (like Arrow); so, may not pull in either of these packages at install time (despite building with CUDA support)

Also it would be better to group the cudatoolkit patch versions together like how cuda-version is handled

Nevertheless this is a good rough test of the data. It does seem to be picking up download counts for these packages that were missed in prior months (which had been off by a couple orders of magnitude in the worst case)

Edit: Fix issue where 12.0 got cutoff

cappadona commented 2 months ago

Thanks @jakirkham. May 2024 data was made available this past Saturday, June 1st.

As of today, the .conda packages are included in the data for the following months:

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

cc @jezdez

jakirkham commented 2 months ago

Based on the community feedback thus far, we're considering replacing data for additional prior months, and updating them to also include .conda packages. Stay tuned.

Thanks Nick! πŸ™

This would be incredibly helpful πŸ™‚

h-vetinari commented 2 months ago

It would be amazing to pull these updates back to the introduction of .conda artefacts, both for having a correct history and an accurate total number of downloads. The conda-forge landing page currently prominently displays the latter, and I think we're still not counting over a year of .conda downloads.

If one goes and executes by-the-numbers notebook linked from the conda-forge landing page (with some minor adaptations to update the loop over which years we're interested in), we get the following for 2021-2023:

Untitled

While there's undoubtedly some variability in the monthly data, to my understanding that sharp drop-off is related to the introduction of .conda around November 2022.

jezdez commented 2 months ago

I agree with @h-vetinari, let’s make this available for the whole time period, doesn’t make sense otherwise IMO.

wolfv commented 2 months ago

Did something happen with the timestamps? For some reason, we seem to have some new entries at "epoch 0" (ie. somewhere in 1970)

Screenshot 2024-06-09 at 09 33 05

I'll delete/filter them from our data but just wanted to check if anyone knows what's up?

jakirkham commented 1 month ago

@cappadona , hope you had a good weekend! πŸ˜€

Do you have thoughts on the questions above? To summarize...

  1. Can we backport the .conda count fix to earlier dates?
  2. Do we know how the timestamps are being created (seeing 1970 references)?

Also would add one more...

  1. How are the anaconda.org numbers generated relative to these? Seeing some differences ( https://github.com/conda-incubator/condastats/issues/18 )

Thanks for your help! πŸ™

modouldemba commented 2 weeks ago

June and July data is still not available. Is there an issue?

From: jakirkham @.> Sent: Monday, July 15, 2024 1:07 AM To: ContinuumIO/anaconda-package-data @.> Cc: Subscribed @.***> Subject: Re: [ContinuumIO/anaconda-package-data] Include .conda packages (Issue #45)

@cappadonahttps://github.com/cappadona , hope you had a good weekend! πŸ˜€

Do you have thoughts on the questions above? To summarize...

  1. Can we backport the .conda count fix to earlier dates?
  2. Do we know how the timestamps are being created (seeing 1970 references)?

Also would add one more...

  1. How are the anaconda.org numbers generated relative to these? Seeing some differences ( conda-incubator/condastats#18https://github.com/conda-incubator/condastats/issues/18 )

Thanks for your help! πŸ™

β€” Reply to this email directly, view it on GitHubhttps://github.com/ContinuumIO/anaconda-package-data/issues/45#issuecomment-2227919527, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AZKHWNUDRGW5XPEWEZBUNSTZMN7L5AVCNFSM6AAAAAA3GO2T3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXHEYTSNJSG4. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

wolfv commented 2 weeks ago

We are removing download counts from our website for now since it doesn't seem to be very reliable and looks just bad :(

jakirkham commented 1 week ago

@cappadona could you please help us with the issues posted above?

Notably: