berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Python Popularity Dashboard Updates! #3571

Open balajialg opened 1 year ago

balajialg commented 1 year ago

Summary

Thanks to our last sprint planning meeting, I spent some time trying to figure out the ways to use the python popularity dashboard to make recommendations for the docker image updates. I accessed the python popularity dashboard and filtered the data for the last 6 months with the intention of identifying packages that are least used across all the hubs during the Summer and Spring semesters.

I realized that we have amazing package installation data for the following hubs - i) Datahub, ii) Data 8, iii) Data 100, iv) Data 102, v) Biology, vi) Julia, vii) D-Lab, and viii) Prob 140 hubs. However, the following hubs do not have their package installation data displayed in the dashboard - i) Astro, ii) EECS, iii) High School, iv) ISchool, v) Stat 159, and vi) Stat 20 hubs. One recommendation is to fix the dashboard to reflect the data associated with these hubs.

I wanted to generate a list of packages that had less than 5 installations during the past 6 months meaning it was rarely used as part of any of the assignments. Listing the least used packages across the hubs for which I had access to the data.

Edit: Focus will be on identifying packages that are not listed as part of the Python Popularity Dashboard as they would have had 0 installs!

Biology Hub:

Packages Installed | Number of times installed in the past 6 months -- | -- requests | 2 protobuf | 2 opt-einsum | 2 keras | 2 httplib2 | 2 h5py | 2 google-auth-httplib2 | 2 google-auth | 2 google-api-python-client | 2 google-api-core | 2 gast | 2 flatbuffers | 2 charset-normalizer | 2 astunparse | 2 absl-py | 2 PySocks | 2 Keras-Preprocessing | 2 screed | 2 bz2file | 2 cryptography | 2 certifi | 2 cachetools | 2

Data 100 Hub

Packages Installed | Number of times installed in the past 6 months -- | -- Pint | 2 Babel | 2 lyricsgenius | 2 opencv-python | 2 opencv-contrib-python | 2 conda | 2 prob140 | 2 XlsxWriter | 2 imbalanced-learn | 1 featurewiz | 1 category-encoders | 1 HeapDict | 1 pyarrow | 1 uncertainties | 1 Pint | 1 Babel | 1 lyricsgenius | 1 dm-tree | 1 langcodes | 1 zict | 1 lightgbm | 1 imbalanced-learn | 1 featurewiz | 1 category-encoders | 1

Data 102 Hub

Packages Installed | Number of times installed in the past 6 months -- | -- chart-studio | 1 retrying | 1 tika | 1 pdfplumber | 1 wordcloud | 1 psycopg2 | 1 psycopg2-binary | 1 timer | 1 causalgraphicalmodels | 1 plot-utils | 1 huggingface-hub | 1 tokenizers | 1 transformers | 1 ImageHash | 1 timer | 1 htmlmin | 1 missingno | 1 multimethod | 1 pandas-profiling | 1 phik | 1 huggingface-hub | 1 htmlmin | 1 missingno | 1 multimethod | 1

Data 8 Hub

Packages Installed | Number of times installed in the past 6 months -- | -- networkx | 1 PyYAML | 1 TPOT | 1 bokeh | 1 deap | 1 scikit-optimize | 1 stopit | 1 update-checker | 2 wget | 2 conda | 2 gdflib | 2 treelib | 2 websockets | 2 selenium | 2 monty | 3 pydantic | 3 pymatgen | 3 spglib | 3 uncertainties | 3

Datahub

Packages Installed | Number of times installed in the past 6 months -- | -- natsort | 1 pandana | 1 control | 1 RateMyProfessorAPI | 1 rectpack | 1 pyxdf | 1 jupyterthemes | 1 rectpack | 1 pyxdf | 1 jupyterthemes | 1 pyxdf | 1 jupyterthemes | 1 lesscpy | 1 en-core-web-lg | 1 layoutparser | 1 aspose-words | 1 en-core-web-lg | 1 layoutparser | 1 aspose-words | 1 cpi | 1 forestci | 1 pyspark | 1 databricks-connect | 1 casadi | 1 pg-ethics | 1 googlemaps | 1 pyspark | 1 databricks-connect | 1 casadi | 1 pg-ethics | 1 googlemaps | 1 rpy2 | 1

Tasks to complete

yuvipanda commented 1 year ago

Thanks for working on this, @balajialg!

I think these lists actually illustrate why actully using this data to prune libraries is a bunch of work that is possible, but needs to be carefully considered - transitive dependencies. For example, if we take the biology hub and remove requests, a lot of other packages will just stop working, as they use it transitively for making HTTP requests. If you uninstall requests, okpy stops working and grading using that comes to a halt. I only know that okpy is using requests because I was trying to fix something related (https://github.com/okpy/ok-client/pull/473), but am sure there's a lot of other packages that we don't know about. Same applies for at least the cryptography, certifi and cachetools package, and I'm sure more.

I think the easy use of the dashboard is to ask 'can I remove this one specific package that I want to remove for other reasons?'. It can definitely be used to prune list of installed packages, but will need more work to identify which can be removed and which can't be.

felder commented 1 year ago

@balajialg What yuvi is expressing was my concern as well. Just because a package isn't popular doesn't mean it isn't required by something else that is. Like I see conda for data100, I'd question removing that one as well!

ryanlovett commented 1 year ago

@yuvipanda Does the popcon support track just the explicit imports by user code, or does it also track the imports that get pulled in under the hood? I thought it was hooking into the import mechanism at a low level so was catching everything, but given that okpy uses requests, it sounds like I've got that wrong.

balajialg commented 1 year ago

@yuvipanda I did not realize that transitive dependency is going to be a huge challenge with regard to this effort. Thanks for pointing it out! With the context you shared, a few questions I have are,

  1. Is there a way to track the packages "installed under the hood" (Thanks @rylo for coining this term) as part of the Python Popularity Dashboard? If not, what other ways exist to find transitive dependencies? Is it by trial and error mechanism or is there a process to the madness?
  2. Considering the trade-off involved in accounting for transitive dependencies, Should we prune the image based on our exploration or just let the docker image grow large? (to the point where we realize that pruning is the only way forward? or have we reached the stage already?)
yuvipanda commented 1 year ago

@ryanlovett it pulls in all imports, transitive or otherwise.

@balajialg Are you talking about 'installation' which happens only once in the image when we build it, or 'use' which is what this dashboard is tracking?

I think the way to us this dashboard is:

  1. Look at a package we are explicitly installing in our image, and see we want to remove it for some reason (I removed allensdk in https://github.com/berkeley-dsep-infra/datahub/pull/3608 for example, because it required a very specific version of pandas that was pretty old).
  2. Use the dashboard to determine if anyone is actually using it, or if it can be removed.

So the dashboard can be used to determine if a package we want to remove can be removed, but can not be used to 'generate a list of packages to be removed'.

yuvipanda commented 1 year ago

A pruning process would look like:

  1. Look at https://github.com/berkeley-dsep-infra/datahub/blob/staging/deployments/datahub/images/default/requirements.txt
  2. Consider bunches of packages installed for specific classes
  3. Investigate if they have been used at all. non-zero use count already complicates things
  4. But if use count is 0, it can be removed
  5. Repeat and see how much smaller our image gets!
balajialg commented 1 year ago

@yuvipanda Thanks for the detailed pruning process! I referred to the term "use" - how many times Python libraries were imported in a notebook specific to a hub. Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs, what is that specific non-zero number for usage that we can safely assume (safely - devoid of transitive dependencies) as the threshold for removal? If we strongly believe that it should be 0, I am not sure whether this process will have any impact on the docker image size.

yuvipanda commented 1 year ago

Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs

I don't think this is true - I found no uses of allensdk before I removed it, for example. Packages that aren't used just don't show up in the popularity dashboard.

balajialg commented 1 year ago

@yuvipanda oh wow, that changes our approach drastically. Now, this list should be compiled based on the differences between our docker image and the popularity dashboard, right?