Open balajialg opened 2 years ago
Thanks for working on this, @balajialg!
I think these lists actually illustrate why actully using this data to prune libraries is a bunch of work that is possible, but needs to be carefully considered - transitive dependencies. For example, if we take the biology hub and remove requests
, a lot of other packages will just stop working, as they use it transitively for making HTTP requests. If you uninstall requests
, okpy
stops working and grading using that comes to a halt. I only know that okpy is using requests because I was trying to fix something related (https://github.com/okpy/ok-client/pull/473), but am sure there's a lot of other packages that we don't know about. Same applies for at least the cryptography
, certifi
and cachetools
package, and I'm sure more.
I think the easy use of the dashboard is to ask 'can I remove this one specific package that I want to remove for other reasons?'. It can definitely be used to prune list of installed packages, but will need more work to identify which can be removed and which can't be.
@balajialg What yuvi is expressing was my concern as well. Just because a package isn't popular doesn't mean it isn't required by something else that is. Like I see conda for data100, I'd question removing that one as well!
@yuvipanda Does the popcon support track just the explicit imports by user code, or does it also track the imports that get pulled in under the hood? I thought it was hooking into the import mechanism at a low level so was catching everything, but given that okpy uses requests, it sounds like I've got that wrong.
@yuvipanda I did not realize that transitive dependency is going to be a huge challenge with regard to this effort. Thanks for pointing it out! With the context you shared, a few questions I have are,
@ryanlovett it pulls in all imports, transitive or otherwise.
@balajialg Are you talking about 'installation' which happens only once in the image when we build it, or 'use' which is what this dashboard is tracking?
I think the way to us this dashboard is:
allensdk
in https://github.com/berkeley-dsep-infra/datahub/pull/3608 for example, because it required a very specific version of pandas that was pretty old).So the dashboard can be used to determine if a package we want to remove can be removed, but can not be used to 'generate a list of packages to be removed'.
A pruning process would look like:
@yuvipanda Thanks for the detailed pruning process! I referred to the term "use" - how many times Python libraries were imported in a notebook specific to a hub. Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs, what is that specific non-zero number for usage that we can safely assume (safely - devoid of transitive dependencies) as the threshold for removal? If we strongly believe that it should be 0, I am not sure whether this process will have any impact on the docker image size.
Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs
I don't think this is true - I found no uses of allensdk
before I removed it, for example. Packages that aren't used just don't show up in the popularity dashboard.
@yuvipanda oh wow, that changes our approach drastically. Now, this list should be compiled based on the differences between our docker image and the popularity dashboard, right?
Summary
Thanks to our last sprint planning meeting, I spent some time trying to figure out the ways to use the python popularity dashboard to make recommendations for the docker image updates. I accessed the python popularity dashboard and filtered the data for the last 6 months with the intention of identifying packages that are least used across all the hubs during the Summer and Spring semesters.
I realized that we have amazing package installation data for the following hubs - i) Datahub, ii) Data 8, iii) Data 100, iv) Data 102, v) Biology, vi) Julia, vii) D-Lab, and viii) Prob 140 hubs. However, the following hubs do not have their package installation data displayed in the dashboard - i) Astro, ii) EECS, iii) High School, iv) ISchool, v) Stat 159, and vi) Stat 20 hubs. One recommendation is to fix the dashboard to reflect the data associated with these hubs.
I wanted to generate a list of packages that had less than 5 installations during the past 6 months meaning it was rarely used as part of any of the assignments. Listing the least used packages across the hubs for which I had access to the data.
Edit: Focus will be on identifying packages that are not listed as part of the Python Popularity Dashboard as they would have had 0 installs!
Biology Hub:
Packages Installed | Number of times installed in the past 6 months -- | -- requests | 2 protobuf | 2 opt-einsum | 2 keras | 2 httplib2 | 2 h5py | 2 google-auth-httplib2 | 2 google-auth | 2 google-api-python-client | 2 google-api-core | 2 gast | 2 flatbuffers | 2 charset-normalizer | 2 astunparse | 2 absl-py | 2 PySocks | 2 Keras-Preprocessing | 2 screed | 2 bz2file | 2 cryptography | 2 certifi | 2 cachetools | 2Data 100 Hub
Packages Installed | Number of times installed in the past 6 months -- | -- Pint | 2 Babel | 2 lyricsgenius | 2 opencv-python | 2 opencv-contrib-python | 2 conda | 2 prob140 | 2 XlsxWriter | 2 imbalanced-learn | 1 featurewiz | 1 category-encoders | 1 HeapDict | 1 pyarrow | 1 uncertainties | 1 Pint | 1 Babel | 1 lyricsgenius | 1 dm-tree | 1 langcodes | 1 zict | 1 lightgbm | 1 imbalanced-learn | 1 featurewiz | 1 category-encoders | 1Data 102 Hub
Packages Installed | Number of times installed in the past 6 months -- | -- chart-studio | 1 retrying | 1 tika | 1 pdfplumber | 1 wordcloud | 1 psycopg2 | 1 psycopg2-binary | 1 timer | 1 causalgraphicalmodels | 1 plot-utils | 1 huggingface-hub | 1 tokenizers | 1 transformers | 1 ImageHash | 1 timer | 1 htmlmin | 1 missingno | 1 multimethod | 1 pandas-profiling | 1 phik | 1 huggingface-hub | 1 htmlmin | 1 missingno | 1 multimethod | 1Data 8 Hub
Packages Installed | Number of times installed in the past 6 months -- | -- networkx | 1 PyYAML | 1 TPOT | 1 bokeh | 1 deap | 1 scikit-optimize | 1 stopit | 1 update-checker | 2 wget | 2 conda | 2 gdflib | 2 treelib | 2 websockets | 2 selenium | 2 monty | 3 pydantic | 3 pymatgen | 3 spglib | 3 uncertainties | 3Datahub
Packages Installed | Number of times installed in the past 6 months -- | -- natsort | 1 pandana | 1 control | 1 RateMyProfessorAPI | 1 rectpack | 1 pyxdf | 1 jupyterthemes | 1 rectpack | 1 pyxdf | 1 jupyterthemes | 1 pyxdf | 1 jupyterthemes | 1 lesscpy | 1 en-core-web-lg | 1 layoutparser | 1 aspose-words | 1 en-core-web-lg | 1 layoutparser | 1 aspose-words | 1 cpi | 1 forestci | 1 pyspark | 1 databricks-connect | 1 casadi | 1 pg-ethics | 1 googlemaps | 1 pyspark | 1 databricks-connect | 1 casadi | 1 pg-ethics | 1 googlemaps | 1 rpy2 | 1Tasks to complete