balajialg commented 2 years ago

Summary

Thanks to @yuvipanda's nudge, I started working on the feature matrix document to segment our instructors based on the type of features they use. My initial understanding is that we can classify the instructors into three archetypes - instructors using foundational/intermediary or complex use cases. I also spent some time mapping which features map to which user archetypes in the doc. Open to the team's input on whether the classification makes sense.

I would like to get the team's input on whether it is possible to retrieve metrics around usage for a particular feature? (Thanks @ryanlovett for nudging me to think in this direction). Just like the python popularity dashboard which tracks the usage of python libraries, Is it possible for us to track the most used features by our instructors? We can use this information during the semester onboarding to tailor the feature demo based on their prior usage

Feature List

Jupyter Classic Notebook
JupyterLab
RetroLab
RStudio
Jupyter R Kernel
R Dashboarding (Shiny)
Nbgitpuller extensions for Chrome/Mozilla File Management
Real-time file sharing using SyncThing
Real-Time Collaboration
File Archiving
Shared Volumes User Management
Admin Access
Secure Github authentication Application
Linux Desktop environment
SyncThing
Persistent Storage
- Postgres DB
- SQLite 3rd party libraries
Otter Grader
Installing Lab extensions (JupyterLab, RetroLab)
Creating custom kernels (Conda environment, etc..) High-Performance Computing
Dask based clusters
Auto Scaling via Calendar

User Stories

As an infrastructure admin, I would like to know the data about the number of users (and if possible their hub names, course names) using a specific datahub feature so that I can use that information to classify whether their usage is foundational, intermediary, or advanced.

Tasks

[x] Work with @yuvipanda to retrieve the logs data for analysis
[ ] Analyze the logs data
[ ] Build a feature audit for Datahub (https://www.intercom.com/blog/before-you-plan-your-product-roadmap/)
[ ] Use the feature audit to make decisions about the UI/Communication/Developing new features etc...

ryanlovett commented 2 years ago

I think the logs within the Google cloud console should show frequency of urls like /tree, /rstudio, /lab, etc. They would also have gitpuller URLs, syncthing, desktop. Basically features in the URL can be tracked.

However course info is not in the URL, except in cases where the course info happens to be in a git repo name. The latter don't follow any strict format though. Instructors can name their repos however they like.

balajialg commented 2 years ago

@ryanlovett Amazing to know that feature-related metrics can be tracked. Can understand the complexity related to retrieving course-related information. I will follow up more on ways to retrieve feature-related information in the Google Cloud console

balajialg commented 2 years ago

Next Steps from March Sprint Planning Meeting:

[x] @balajialg to formulate questions for the feature usage!
[x] @balajialg to work with @felder to get required data for different feature usage related questions

balajialg commented 2 years ago

@felder - During your free time, Can you please let me know whether we can get analytics data to answer the below questions? It would help build a narrative around Datahub's value proposition.

[ ] How many instructors use Jupyter Classic Notebook, JLab, and Retro Lab?
[ ] How many instructors use R Studio, Jupyter R Kernel, and R Shiny?
[ ] How many instructors use Admin functionality and Shared volume?
[ ] How many instructors use Otter Grader/Gofer grader for auto-grading?
[ ] How many instructors use JLab plugins?'What type of plugins is most used?
[ ] How many instructors use persistent storage? (Postgres DB, SQLite DB, etc..)

If instructor-level data is not available, how would you like these questions to be framed so that we can get data that are closely relevant to the question being asked?

balajialg commented 2 years ago

@felder Sharing the context from the conversation which happened between us in Slack, Instructor + Course-specific data cannot be retrieved with the GCP logs that are stored currently. It would require us to figure out another mechanism to fetch the data (possibly nbgitpuller links).

I will set up some time with you the week after to figure out the near-term scope of the data to be retrieved and options we can explore to answer the highlighted questions in the longer run.

balajialg commented 2 years ago

Qualitative Insights based on preliminary log analysis from the last 30 days:

R Kernel: Datahub and Biology hub users actively use R Kernel
Lab: Data 100 hub users use Jupyter Lab extensively
Remote Desktop: Astro and EECS hub users use remote desktop environments regularly
Shared Directory: Datahub, Data 100, Astro, Biology hub users use the shared directory
Shared-read-write: Data100 hub users use a shared read-write directory. More manual exploration is required across other hubs.
Conda Environment: Conda environment created for the Genomics class taught by Priya Moorjani is not used this semester
Admin Functionality The admin feature is actively used by Datahub, Astro, Data 8, Public Health, EECS, Data 100, Dlab, workshop, ISchool, Highschool, Julia, Data 102, Stat 159, Biology, R, Prob 140, and Stat 20 hub users.
Syncthing is not having much usage across any of the hubs
Shiny doesn't have usage across any of the hubs

ryanlovett commented 2 years ago

@balajialg Interesting, thanks! Do you know what admin is being used for? It is just to view the list of people or is it being used to stop/start servers too?

balajialg commented 2 years ago

@ryanlovett I searched

`resource.type="k8s_container" resource.labels.cluster_name="fall-2019" resource.labels.namespace_name=~"-prod" resource.labels.container_name="notebook" textPayload=~"oauth2/authorize?"

in the log explorer to see how many users actively click on the option "access server" to access other users' hub instances. Apparently, I could see that almost all hub users noted in the above comment access other users' instances. Let me know if querying "oauth2/authorize?" as part of the text payload is the right search query to retrieve users clicking on access server options.

Here is the link to the gcp log explorer with the search query

yuvipanda commented 2 years ago

@balajialg i'm not sure but maybe oauth2/authorize may also be used each time a user logs in, regardless of wether it's used with admin access or not.

Another way is to look at just the hub logs and look for uses of the admin panel there by URL, with resource.labels.container_name="hub"

yuvipanda commented 2 years ago

The other point is that these are only logs across last 30 days, so we can't make inferences about longer term usage patterns. We can start saving the nginx logs too though, and make that happen.

balajialg commented 2 years ago

@yuvipanda Completely agree with you! I am looking at the above points as potential hypotheses to explore possible trends with the long-term log data. My other hypothesis is that except for a few variations, this data should highly correlate with the long-term data (considering this is a snapshot from mid-semester). But I can be completely wrong about this point.

Searching for hub logs - I am seeing entries in the logs for all hubs which I am not able to make sense of. Should I interpret these logs as admin access features that got widely used by instructors/GSIs across hubs over the past month or some of these logs are configuration-based and did not get logged due to a user action? Check the log results here

yuvipanda commented 2 years ago

@balajialg what log lines do you get when you access admin hub yourself? Basically we need to look at that and derive regexes and filters from that info. Some post processing may also be needed.

Everything with wp-admin or webadmin or similar is bots trying to exploit our hub because maybe it's a wordpress instance or a similar piece of software with known vulnerabilities :D

yuvipanda commented 2 years ago

I think a basic process should be to:

Do the thing you're trying to measure
See what logs show up, if any.
Be very careful in vetting the hypothesis that what you see in (2) always shows up when you do (1) but at no other time. This can be a bit difficult but definitely doable.
Document the process as we go along so we don't lose track.

balajialg commented 2 years ago

@yuvipanda Looked at the hub logs by searching for the textPayload=~"\s/hub/admin\s" search query. I observe that the resulting logs are similar to the network logs when the admin portal is accessed by the user in Datahub. Obviously, it requires a bit of post-processing in terms of drawing meaningful insights but I can see that logs are from multiple hubs like Datahub, Data 8, Data 100, etc.

Thanks for detailing the process! I am spending a lot of time fine-tuning the search query (learning regex on the side) to ensure that the search results only show up for that particular search query. It is time-intensive.

Is the Nginx log structure similar to the current logs? or would it require fine-tuning the search query once more based on the resulting logs?

berkeley-dsep-infra / datahub

Feature audit based on logs history #3267

Summary

Feature List

User Stories

Tasks