berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Analyze files stored in Per Course/Department Filestores #4414

Open balajialg opened 1 year ago

balajialg commented 1 year ago

Summary

Thanks to @shaneknapp's awesome work labeling GCP billing reports (#4381), we can calculate per-hub costs with reasonable accuracy. Along with @ericvd-ucb, I looked at the billing data for hubs used in some of our big classes such as Data 8, Data 100, Stat 159, etc. We realized that Data 100 cloud spends were north of $200 per day (based on the limited data we had) and the majority of the costs were accounted for Google filestore instances which amounted to almost 11 TB.

Our initial hypotheses were that a) Many of the files are old student files that are not yet archived and/or b) Large datasets stored in users' home directories. We tried to see if there are some recommendations for the instructors by analyzing a few students' home directories. We could use this information to move files/recommend instructors to use leaner datasets which potentially can lead to downsizing the filestore for this class and save $$$. This will be an ongoing project to document findings about the nature of files stored in student home directories.

We tried to understand what kind of files are stored in the accounts of students who accessed the hub 4 months ago (meaning these students completed Data 100 during Fall 22). We ran the following command in student terminals (accessed them through the admin interface filtering for the "last 4 months"),

find . -type f -name '*.csv' -exec du -ch {} + | grep total$ for finding the summation of Datasets (CSV files) stored in a particular student home directory

find . -type f -name '*.ipynb' -exec du -ch {} + | grep total$ for finding the summation of Notebook files (ipynb) stored in a particular student home directory

From our exploration with students, we found that they store different kinds of files like .ipynb, .csv, .zip, .nc, .txt, .png etc..

For a particular student, the total size of all their files summed to 2.6G and the individual distribution looked something like this

(notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.nc' -exec du -ch {} + | grep total$ 530M total (notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.zip' -exec du -ch {} + | grep total$ 451M total (notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.csv' -exec du -ch {} + | grep total$ 517M total (notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.png' -exec du -ch {} + | grep total$ 12M total (notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.ipynb' -exec du -ch {} + | grep total$ 127M total (notebook) jovyan@jupyter-xxx:~$ find . -type f -name '*.py' -exec du -ch {} + | grep total$ 964K total

Representing student data from Data 100 as a tabular column,

File type vs Size .ipynb .csv .zip .nc .png .py Total Size Total Size without Shared directory
User 1 127M 517M 451M 530M 12M 964K 2.6G
User 2 59M 483M 263M 530M 5.6M 120k 1.8G
User 3 69M 327M 308M 530M 7.2M 104k 1.9G
User 4 199M 616M 956M 561M 23M 3M 6.8G

I ran du -shc --exclude=shared to get the data stored in user directories without accounting for shared directory.

In this example, .nc and .csv files map to the datasets which is almost ~1GB (50% of the total file storage). In addition,total notebook files (.ipynb) amounted to almost 127MB. Auto grader related files (.zip, .py) amounted to ~500 MB.

I have been spending sometime trying to analyze the kind of files used in Biology hub. I found that almost ~40G worth of files are stored in the shared directory and almost ~33G files are .fastq files which is a format for storing biological sequence.

Representing student data from Biology hubs,

File type vs Size .fastq (Mostly stored in shared dir) .ipynb .csv .zip .nc .png .py .gz (Mostly stored in shared dir) Total Size without Shared directory Total Size
User 1 33G 816K 4.8M 147M 0M 1.3M 0M 1.3G 6.8 GB 40G
User 2 33G 432K 9M 0M 1.2G 3.8 GB
User 3 33G 45M 3M 0M 1.2G 123M
User 4 33G 3.3M 5M 0M 1.2G 12M
User 5 316G 2.4M 392K 0M 1.4G 578G 612G
User 6 35G 2.3M 5M 0M 1.2G 21G 54G

Almost 33G worth of data is stored in the shared directory in Biology hub. Most of the files are in .fastq format.

We will keep adding more information about students' home directories here so that we can get a comprehensive summary of what's happening here. We hope that we can recommend inputs to instructors wrt datasets and reduce the size of filestores. In addition, we can use this data to formulate our filestore storage policy for per-hub storage like Data 100.

User Stories

As an infra admin, I need to understand the amount and type of files stored in Data 100 hub student home directories to provide pedagogical recommendations to instructors

As an infra admin, I want to define storage policy for per-hub courses

Important information

Tasks to complete

balajialg commented 1 year ago

Few questions I would like to get clarity from Data 100 staff after analyzing few users' home directories,

  1. What is the pedagogical use case for .nc and .zip files?
  2. Can .nc, .zip and .csv files get moved to the shared read-write directories instead of individual user's home directories? What are the trade offs wrt making this decision?
balajialg commented 1 year ago

@ryanlovett How trustworthy is the last activity dashboard in the biology admin dashboard? Can we use that data to make informed decisions about our users activity in the biology hub (Like when they accessed the hub)? Also, is it possible to sort the entire data (not just the 50 entries showed in every page) by last activity? This would help with my exploration here.

balajialg commented 1 year ago

Never mind, Last activity sort is an open bug in Jupyterhub - https://github.com/jupyterhub/jupyterhub/issues/3816