berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
64 stars 39 forks source link

Feature audit based on logs history #3267

Open balajialg opened 2 years ago

balajialg commented 2 years ago

Summary

Thanks to @yuvipanda's nudge, I started working on the feature matrix document to segment our instructors based on the type of features they use. My initial understanding is that we can classify the instructors into three archetypes - instructors using foundational/intermediary or complex use cases. I also spent some time mapping which features map to which user archetypes in the doc. Open to the team's input on whether the classification makes sense.

I would like to get the team's input on whether it is possible to retrieve metrics around usage for a particular feature? (Thanks @ryanlovett for nudging me to think in this direction). Just like the python popularity dashboard which tracks the usage of python libraries, Is it possible for us to track the most used features by our instructors? We can use this information during the semester onboarding to tailor the feature demo based on their prior usage

Feature List

User Stories

Tasks

ryanlovett commented 2 years ago

I think the logs within the Google cloud console should show frequency of urls like /tree, /rstudio, /lab, etc. They would also have gitpuller URLs, syncthing, desktop. Basically features in the URL can be tracked.

However course info is not in the URL, except in cases where the course info happens to be in a git repo name. The latter don't follow any strict format though. Instructors can name their repos however they like.

balajialg commented 2 years ago

@ryanlovett Amazing to know that feature-related metrics can be tracked. Can understand the complexity related to retrieving course-related information. I will follow up more on ways to retrieve feature-related information in the Google Cloud console

balajialg commented 2 years ago

Next Steps from March Sprint Planning Meeting:

balajialg commented 2 years ago

@felder - During your free time, Can you please let me know whether we can get analytics data to answer the below questions? It would help build a narrative around Datahub's value proposition.

If instructor-level data is not available, how would you like these questions to be framed so that we can get data that are closely relevant to the question being asked?

balajialg commented 2 years ago

@felder Sharing the context from the conversation which happened between us in Slack, Instructor + Course-specific data cannot be retrieved with the GCP logs that are stored currently. It would require us to figure out another mechanism to fetch the data (possibly nbgitpuller links).

I will set up some time with you the week after to figure out the near-term scope of the data to be retrieved and options we can explore to answer the highlighted questions in the longer run.

balajialg commented 2 years ago

Qualitative Insights based on preliminary log analysis from the last 30 days:

ryanlovett commented 2 years ago

@balajialg Interesting, thanks! Do you know what admin is being used for? It is just to view the list of people or is it being used to stop/start servers too?

balajialg commented 2 years ago

@ryanlovett I searched

`resource.type="k8s_container" resource.labels.cluster_name="fall-2019" resource.labels.namespace_name=~"-prod" resource.labels.container_name="notebook" textPayload=~"oauth2/authorize?"

in the log explorer to see how many users actively click on the option "access server" to access other users' hub instances. Apparently, I could see that almost all hub users noted in the above comment access other users' instances. Let me know if querying "oauth2/authorize?" as part of the text payload is the right search query to retrieve users clicking on access server options.

Here is the link to the gcp log explorer with the search query

yuvipanda commented 2 years ago

@balajialg i'm not sure but maybe oauth2/authorize may also be used each time a user logs in, regardless of wether it's used with admin access or not.

Another way is to look at just the hub logs and look for uses of the admin panel there by URL, with resource.labels.container_name="hub"

yuvipanda commented 2 years ago

The other point is that these are only logs across last 30 days, so we can't make inferences about longer term usage patterns. We can start saving the nginx logs too though, and make that happen.

balajialg commented 2 years ago

@yuvipanda Completely agree with you! I am looking at the above points as potential hypotheses to explore possible trends with the long-term log data. My other hypothesis is that except for a few variations, this data should highly correlate with the long-term data (considering this is a snapshot from mid-semester). But I can be completely wrong about this point.

Searching for hub logs - I am seeing entries in the logs for all hubs which I am not able to make sense of. Should I interpret these logs as admin access features that got widely used by instructors/GSIs across hubs over the past month or some of these logs are configuration-based and did not get logged due to a user action? Check the log results here

yuvipanda commented 2 years ago

@balajialg what log lines do you get when you access admin hub yourself? Basically we need to look at that and derive regexes and filters from that info. Some post processing may also be needed.

Everything with wp-admin or webadmin or similar is bots trying to exploit our hub because maybe it's a wordpress instance or a similar piece of software with known vulnerabilities :D

yuvipanda commented 2 years ago

I think a basic process should be to:

  1. Do the thing you're trying to measure
  2. See what logs show up, if any.
  3. Be very careful in vetting the hypothesis that what you see in (2) always shows up when you do (1) but at no other time. This can be a bit difficult but definitely doable.
  4. Document the process as we go along so we don't lose track.
balajialg commented 2 years ago

@yuvipanda Looked at the hub logs by searching for the textPayload=~"\s/hub/admin\s" search query. I observe that the resulting logs are similar to the network logs when the admin portal is accessed by the user in Datahub. Obviously, it requires a bit of post-processing in terms of drawing meaningful insights but I can see that logs are from multiple hubs like Datahub, Data 8, Data 100, etc.

Thanks for detailing the process! I am spending a lot of time fine-tuning the search query (learning regex on the side) to ensure that the search results only show up for that particular search query. It is time-intensive.

Is the Nginx log structure similar to the current logs? or would it require fine-tuning the search query once more based on the resulting logs?