OpenCourseAPI / OwlAPI

An open source REST API written in Python to scrape and serve Foothill / De Anza course data :ledger:
https://opencourse.dev:3000
MIT License
9 stars 7 forks source link

Feature request - density plot of class times #57

Open fractalbach opened 6 years ago

fractalbach commented 6 years ago

Would be useful to visualize /quantify when most classes take place, and when most people are free.

A way to filter the most common classes as well.

Would be useful for planning the best times for events/workshops/etc. Would be able to see when ppl are most likely available without having to do surveys .

Graphs

Going to update this with more ideas while more ideas are thought of. Can be used later when making graphs.

Data Sets

X-axis

Y-axis

for each value t in x_axis : classes_in_session[t] = {class ∈ (Set of all classes) such that (class[start] < t < class[end])} y[t] 👐 0 for each class in classes_in_session[t]: y[t] 👐 y[t] + class[students_enrolled]

Note that this does not account for students who didn't show up to class ;) 

# programs

There could be an **Updater**, which fetches the data from the API and saves it into a Json file with filename based on date/time, into a folder called **cache**

There could be another program called **data pre proccessor**, which does any intermediate calculations.  It outputs a new file with the data in such a way that it can be used directly by the grapher.

File_renamer could be a program that generates a list of all the cache and data filenames, which could be used by the HTML file to determine *where* all the data is.  This might be helpful because filenames would be based on date/times, and it will be hard to predict what the next one is called. (Alternatively, make the filenames easily predictable, like data1, data2, data3 ... , and save the timestamp in the content) 

The graph.html can then simply load the prepared data, and display it. 

### how to use

One possibility is to just have the updater call the data proccessor directly. 

./program



Externally, you would just call the program and  give the directory names where you want to store the output files (the filenames will be automatically generated).
phi-line commented 6 years ago

This is a really good idea! I would love to see a plot of this as well.

You are welcome to make an app that uses the API to generate such a plot. Something like this extends the scope of the API itself but you can use the batch endpoint to get all courses and generate a plot based on the the time range.

fractalbach commented 6 years ago

ooo could even use the student count field as well. Which would mean you could also add another plot for the y-axis: number of students in class, or a % of students in class

fractalbach commented 6 years ago

https://plot.ly/javascript/bar-charts/

Could use stacked bar graph to show different classes https://plot.ly/javascript/bar-charts/#stacked-bar-chart

fractalbach commented 6 years ago

Would also be interesting to see the change in seats available over time while sign-ups are open

fractalbach commented 6 years ago

These values: Section Capacity and Section Actual, would definitely be needed, is there currently an easy way to get these? capture

You can find it by following:

Alternatively, see what you can find at https://banssb.fhda.edu/ .... πŸ˜„like this πŸ˜‰

phi-line commented 6 years ago

Im working hard to get those three values, right now the API only lists rem. The advanced data holds all 6 of these fields. I'm currently working on an issue to get them.

To answer your question it is hard to get. It's an authenticated request that needs to be sent over to MyPortal. In order to do that I need to spoof a login and then scrape cookies to process the request for the data

fractalbach commented 6 years ago

Follow link above with emojis

I was able to reach it on my phone (never have logged in) while in incognito mode.

phi-line commented 6 years ago

Hmm maybe I should be going through https://banssb.fhda.edu/ instead. I'll dig around - thanks for the tip :)

phi-line commented 6 years ago

This kind of density plot is known as a [Kernel Density Estimator](). They are a powerful alternative to a histogram since they don not rely on bins. Sci-kit learn docs give this example image: image

A major problem with histograms, however, is that the choice of binning can have a disproportionate effect on the resulting visualization. Consider the upper-right panel of the above figure. It shows a histogram over the same data, with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the appropriate grid space, we recover the histogram. But what if, instead of stacking the blocks on a regular grid, we center each block on the point it represents, and sum the total height at each location? This idea leads to the lower-left visualization. It is perhaps not as clean as a histogram, but the fact that the data drive the block locations mean that it is a much better representation of the underlying data.

This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block at each point). We can recover a smoother distribution by using a smoother kernel. The bottom-right plot shows a Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution of points.

In my experiment in music analysis - Oolong, I used a KDE to create a density plot of a combined dataset of 'feature' scratterplots. The resulting model looks like this with the high density of features shown in yellow and the least density of features in blue.

mss-house_nearest

Honestly, this method is way overkill for the use-case you described but I thought you would like to know more about density estimations :)

fractalbach commented 6 years ago

Yes, very pretty graphs.

In the case of students in classes we actually aren't taking a random sample because we have exact information.

However, if we are looking at the points over time, and we only have partial information (which we probably will), then some smoothing might be useful when we want to find "estimated % of students not in class at 3pm next Tuesday" .

I think a bar graph would probably be the right thing to use, since density graph implies we are taking a function of a random variable.

(An example of random variable would be number of students who actually showed up to class, and a random sample would be if we go around and record how many students are in the class, and compare to the number enrolled)

....

Although... If we sum all of the students together, and just say plot number of students who could be in class over time divided into intervals... Then arguably it would be a histogram, and doing the smoothing would make sense.

We can try both :D

fractalbach commented 6 years ago

https://en.m.wikipedia.org/wiki/Probability_distribution?wprov=sfla1