berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
62 stars 37 forks source link

Analyze logs data to list course enrollments! #2949

Open balajialg opened 2 years ago

balajialg commented 2 years ago

Analyze the existing log data to calculate the usage metric which can be used to evangelize the usage of the service to the leadership!

Tasks to be done

balajialg commented 2 years ago

@yuvipanda Any luck with this request?

ericvd-ucb commented 2 years ago

@balajialg I am also interested in seeing if we can discover the long tail of courses, there are probably a bunch of courses using unique repos that could lead us to learn about a longer list of courses than we already know about

balajialg commented 2 years ago

@ericvd-ucb Yes, I am also trying to find the 3k delta between the raw data analysis and the course enrollment information.

Initial analysis of the data highlights that the following repos are being used fairly regularly (Read the data below as repo name followed by number of hits to the repo)! I will spend sometime to map the repos to specific courses and probably reach out to you to clarify repos which doesn't have a clear mapping with any course,

materials-fa21 56754 fa21 31842 PS3-FA21-Public 11598 public-notebooks 11204 ph142-fa21 5288 eecs16a-lab-fa21 4088 stat20-fall21 3745 3309 public-fa21-hw-notebooks 3306 content-fa21 2980 Bio-1B 1948 88e-fa21.git 1886 Physics-88-Fa21 1835 ECON-140-FA21.git 1544 MCB-32 1451 sp21 1418 d 941 demog180-fa2021 910 PHW251_Fall2021 888 materials-fa20 863 materials-sp20 849 88e-fa21 821 fa20 739 EPS-88-FA21 701 textbook 656 materials-sp21 617 PolSci-88-FA21 604 ER131_2021_public 597 materials-su21 578 fa21_public_hw_notebooks 571 W142Fall2021 557 250-Intro-to-E 549 PH252_F 503 su21 462 ps3-dis-nonfork 453 ENVECON-118-FA21 417 fall2021 406 public_hw_notebooks 372 delong-python-problem-sets 346 su20 340 phy129_fall_2021 338 materials-fa17 300 % 292 materials-fa18 287 materials-su20 270 120_labs_fa21 266 t 256 IB-105-ESPM-125-F 254 public-fa21-hw-solutions 244 materials-fa19 234

yuvipanda commented 2 years ago

@balajialg can yu post the full github URLs? all the materials- courses are data-8, for example - https://github.com/data-8/

balajialg commented 2 years ago

@yuvipanda Stored all the URLs in this CSV file for your reference. Let me know if it suffices?

yuvipanda commented 2 years ago

@balajialg that link is not publicly accessible, I think that's private to just your hub.

balajialg commented 2 years ago

@yuvipanda Hope you can access the CSV here! clicks_latest.csv

balajialg commented 2 years ago

@ericvd-ucb @yuvipanda Based on the top 100 repo data from nbgitpuller analysis, I was able to narrow the difference between log data and course enrolment data from 3000 to around 500 users. Current total amounts to 9915 users based on FA 21 enrolment data from https://classes.berkeley.edu/. Can you quickly check whether the courses added to this list are accurate based on prior context?

Particularly, I am interested to know whether both EECS 16 A and 16 B use Datahub? I could see logs for 16 B but could not find logs specific to 16 A for some reason. If both the courses are using, cumulatively they contribute 2000+ users (similar to Data 8 and 100)

ericvd-ucb commented 2 years ago

check https://eecs16a.org/ ( looks like yes!) and https://eecs16b.org/ ( looks like yes)

yuvipanda commented 2 years ago

https://console.cloud.google.com/storage/browser/ucb-datahub-hub-logs/old%20logs?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=ucb-datahub-2018&prefix=&forceOnObjectsSortingFiltering=false has historical logs that @felder uploaded, and can be used to get this information for the last few years. You can get the same raw logs I provided you by:

  1. Downloading these files (I recommend using gsutil cp to download the entire folder)
  2. Searching the files for the line 302 GET /hub/user-redirect/, via grep or similar. This was how I produced the logs I gave you, and you can use that to determine which nbgitpuller links were clicked. Alternatively, you can grep for seconds to|server stopped to get start / stop times for users as well, that was used in other analysis.

https://console.cloud.google.com/storage/browser/ucb-datahub-hub-logs/stderr;tab=objects?project=ucb-datahub-2018&prefix=&forceOnObjectsSortingFiltering=false has new logs, uploaded every hour. These are in a JSON format, one JSON object per line and you can get the raw logs by looking for the textPayload field in the JSON. You can also get which hub the log line is for (they're all put in together) by looking at the labels.namespace_name field in the JSON. You can use jq, and get raw logs similar to the logs in our historical records by doing something like:

cat <json-log> | jq -r '.resource.labels.namespace_name,.textPayload'

Then you can get either nbgitpuller logs or start / stop logs by following the same process as for the historical raw logs.

Finally, if you just want last 30 days of logs, you can run the following script:

#!/bin/bash
set -euo pipefail

LOG_FILE="${1}"

for NS in $(kubectl get ns | rg prod | choose 0); do
    echo "Fetching usage logs for ${NS}"
    gcloud logging read \
        "resource.type=\"k8s_container\" \
         resource.labels.cluster_name=\"fall-2019\" \
         resource.labels.container_name=\"hub\" \
         resource.labels.namespace_name=\"${NS}\" \
         textPayload=~\"seconds to|server stopped\"" \
         --freshness=30d  --format="value(timestamp,resource.labels.namespace_name,textPayload)" >> "${1}"
done

Save this in a script called fetch-logs.bash, and run ./fetch-logs.bash <output-file-name>.

Similar for nbgitpuller raw data

#!/bin/bash
set -euo pipefail

LOG_FILE="${1}"

for NS in $(kubectl get ns | rg prod | choose 0); do
    echo "Fetching usage logs for ${NS}"
    gcloud logging read \
        "resource.type=\"k8s_container\" \
         resource.labels.cluster_name=\"fall-2019\" \
         resource.labels.container_name=\"hub\" \
         resource.labels.namespace_name=\"${NS}\" \
         textPayload=~\"302 GET /hub/user-redirect/\"" \
         --freshness=30d  --format="value(timestamp,resource.labels.namespace_name,textPayload)" >> "${1}"
done

Note that the only difference here is I'm passing a different regex to textPayload. You can play with that as desired. For these scripts to work, you need gcloud installed and authenticated.

Finally, all these logs contain private student data, so treat with extreme caution.

balajialg commented 2 years ago

@yuvipanda This is awesome documentation! Thank you so much. I downloaded the relevant logs from the Google Cloud buckets. Will analyze them based on your inputs. Thanks again.

balajialg commented 2 years ago

Did some analysis on the historical log data over the past 2 days. Historical log data amounted to almost 10 GB and was available till October 27th, 2021, for all the different hubs. I wrote an R script to determine how much time users are spending on the platform as reported by the logs. Total usage cumulatively comes to around whopping 149.22 years (i.e almost ~ 150 student years) across all hubs. It is interesting to realize the huge amount of time students spent using Datahub.

I will ensure that these metrics are accurate by further verifying this data. Here is the link to the R notebook that I had worked on - https://notebooksharing.space/view/adcf5283a71a10bc2c2f237868f0f71b9e77291c817e87ea52b41905bfd8c7fe#displayOptions=

Breakdown of usage cross major hubs Datahub - 87 years Data 100 - 25 years Prob 140 - 11.51 years EECS Hub - 8.98 years R Hub - 7.26 years Data 8 hub - 7 years Public Health and Julia Hub (combined) - 1.16 years

balajialg commented 2 years ago

This is how the student enrollment count looks like for the courses across different departments/divisions/colleges as of Fall 2021! Usage across deparments

ericvd-ucb commented 2 years ago

I think the unit here is actually where the class is listed, correct? You have this analysis done by what repo they are pulling for and what department is? So it's not really where the students are registered. And unfortunately CDSS and Eecs are overlapping so maybe you should say DSUS

balajialg commented 2 years ago

Yes, this analysis is based on where the class is listed and not where the students are registered. What will be the discrepancy between both scenarios? I can try analyzing with student data if you think that will be more meaningful