Representations and queries

soroushysfi commented 5 years ago

I've come up with these five representations until now. I put some sample data on some of them that it would show how the final Json would look like. It would be great if you guys put your ideas in and see what else we could drive out from this data.

Showing movies genres with node link diagram. (Genres would be big nodes and movies would be little nodes attached to these big nodes) => list of movies containing their Genres. Data sample:
```
[
{title: Toy Story, Genres:[Adventure, Animation, Children, Comedy, Fantasy]},
{title: Jumanji, Genres:[Adventure, Children, Fantasy]},
…
]
```
I would probably want the list of all genres: [Adventure, Children, Fantasy, Comedy,...]
Showing a bar chart in which it shows how many movies we have in each rating. Data would look like:
```
[{title: 1, value: 243},
{title: 2, value: 187},
...]
```
Compare movies from different genres in 2010-2015, to 2015-2019.  
Compare movie ratings in different years. (Comparing 5 star rating in each year, that means seeing what year had the most 5 star rating movies and which year had the least) data sample:
```
 [
 { year: 2010, 
ratings:[ {title: 1, value: 243},  {title: 2, value: 187}, ] }, 
{ year: 2011, 
ratings: […] } 
…] 
```
Genres distribution in different years (for example find out in year 2015 how many horror movies we had and compare it with 2018).

[
                {
                    values: [
{year: '2015', count: 8, title:"Comedy" },
{year: '2016', count: 4, title:"Comedy"}, 
{year: '2017', count: 3, title:"Comedy"}, 
{year: '2018', count: 3, title:"Comedy"}]
                },
                {
                    values: [
{year: '2015', count: 4, title:"Horror" }, 
{year: '2016', count: 6, title:"Horror"},
 {year: '2017', count: 3, title:"Horror"},
 {year: '2018', count: 7, title:"Horror"}]
                },
...
            ]

superliuxz commented 5 years ago

Lovely. Will take a look later. Thanks.

superliuxz commented 5 years ago

So i think these cover the basic stuff (we are querying a single table only).

One more think I can think of is to do a word cloud on tags's tag column example.

More query ideas (that takes two or more tables through join):

highest (average) rating movies (top20) of all time (plotting it with respect to time is hard on front end, query however is simple).
average rating per genres over the years (years when the movie is produced).

I also want to have more user information (gender, occupation etc) but it is not provided.

I can also scrape more info using the links table (IMDB) but not sure if that's needed right now.

SiRumCz commented 5 years ago

sorry guys, I will be working on these Saturday.

soroushysfi commented 5 years ago

I didn't see any data like gender and occupation in the data we're working on. Is it provided?

SiRumCz commented 5 years ago

Recommender: Train a simple recommendation system using simple machine learning techniques such as naive bayes, so that we can predict the rating of any upcoming movie by given its set of genres.

SiRumCz commented 5 years ago

@soroushysfi If I write the API for list of movies containing their Genres, will you be able to come up with node link diagram by Monday?

soroushysfi commented 5 years ago

@SiRumCz Yeah! I could do it in a couple of hours.

SiRumCz commented 5 years ago

@soroushysfi I have pushed the function in assn1-setup-webbackend-and-3-endpoints branch since it is where python flask server is set up.

superliuxz commented 5 years ago

I didn't see any data like gender and occupation in the data we're working on. Is it provided?

It was provided (if you go to the movie lens website and there was an outdated 1M dataset, which has the gender and occupation).

I honestly am not big fan of this assignment. The requirements are pretty unclear (in a bad way: not even mentioning which dataset should we use; he also keeps mentioning the gender and occupation which does not even exist in the latest dataset).

superliuxz commented 5 years ago

Showing movies genres with node link diagram. (Genres would be big nodes and movies would be little nodes attached to these big nodes) => list of movies containing their Genres. Data sample:
 [
{title: Toy Story, Genres:[Adventure, Animation, Children, Comedy, Fantasy]},
{title: Jumanji, Genres:[Adventure, Children, Fantasy]},
…
]

@soroushysfi @SiRumCz I know we haven't had this one done but I think it's a very good example of a more advanced query (as it requires joining two tables), I am willing to work on it now.

@soroushysfi @SiRumCz I am not familiar with node link diagram but I am thinking about something like this:

An API provides the following data:

[
{
"genres": "Children",
"numMovies": 123456,
"data":
  [
    {"movie": "Toy Story", "numRatings": 5678},
    {"movie": "Another Children Movie", "numRatings": 1234},
    {...},
    ...
  ]
},
{
"genres": "Adventure",
"numMovies": 654321,
"data":
  [
    {"movie": "Some Adventure Movie", "numRatings": 7891},
    {"movie": "Another Adventure Movie", "numRatings": 1234},
    {...},
    ...
  ]
},
...
]

So in plain english, an API that list the top 5 most rated movies under each genres, then we can have the node link diagram with genres being the master nodes and the movies being the slave nodes (like what @soroushysfi proposed), and the size of the nodes can be rescaled according to the "numMovies" and "numRatings" key (I can even provide a rescaling factor in the API if you want).

Is it doable?

Like to hear your thoughts.

superliuxz commented 5 years ago

More query ideas (that takes two or more tables through join):

highest (average) rating movies (top20) of all time (plotting it with respect to time is hard on front end, query however is simple).

average rating per genres over the years (years when the movie is produced).

Personally I would like to have these plots because I think we should have more advanced query (not necessary for the demo but for the assignment overall, as stated in the rubric).

soroushysfi commented 5 years ago

I didn't see any data like gender and occupation in the data we're working on. Is it provided?

It was provided (if you go to the movie lens website and there was an outdated 1M dataset, which has the gender and occupation).

I honestly am not big fan of this assignment. The requirements are pretty unclear (in a bad way: not even mentioning which dataset should we use; he also keeps mentioning the gender and occupation which does not even exist in the latest dataset).

Yeah me neither! we have to give him feedback regarding the assignment. This is more like a project than an assignment! and pretty vague in some points.

superliuxz commented 5 years ago

I didn't see any data like gender and occupation in the data we're working on. Is it provided?

It was provided (if you go to the movie lens website and there was an outdated 1M dataset, which has the gender and occupation). I honestly am not big fan of this assignment. The requirements are pretty unclear (in a bad way: not even mentioning which dataset should we use; he also keeps mentioning the gender and occupation which does not even exist in the latest dataset).

Yeah me neither! we have to give him feedback regarding the assignment. This is more like a project than an assignment! and pretty vague in some points.

Yea, I am drafting an email right now asking Sean about how to retrieve the user info.

soroushysfi commented 5 years ago

Showing movies genres with node link diagram. (Genres would be big nodes and movies would be little nodes attached to these big nodes) => list of movies containing their Genres. Data sample:
 [
{title: Toy Story, Genres:[Adventure, Animation, Children, Comedy, Fantasy]},
{title: Jumanji, Genres:[Adventure, Children, Fantasy]},
…
]
@soroushysfi @SiRumCz I know we haven't had this one done but I think it's a very good example of a more advanced query (as it requires joining two tables), I am willing to work on it now.

@soroushysfi @SiRumCz I am not familiar with node link diagram but I am thinking about something like this:

An API provides the following data:
[
  {
    "genres": "Children",
    "numMovies": 123456,
    "data":
      [
        {"movie": "Toy Story", "numRatings": 5678},
        {"movie": "Another Children Movie", "numRatings": 1234},
        {...},
        ...
      ]
  },
  {
    "genres": "Adventure",
    "numMovies": 654321,
    "data":
      [
        {"movie": "Some Adventure Movie", "numRatings": 7891},
        {"movie": "Another Adventure Movie", "numRatings": 1234},
        {...},
        ...
      ]
  },
 ...
]
So in plain english, an API that list the top 5 most rated movies under each genres, then we can have the node link diagram with genres being the master nodes and the movies being the slave nodes (like what @soroushysfi proposed), and the size of the nodes can be rescaled according to the "numMovies" and "numRatings" key (I can even provide a rescaling factor in the API if you want).

Is it doable?

Like to hear your thoughts.

Yeah this is possible. Response is ok, even we can show 10 most rated movies in each genres. The scaling factor might be useful! I can also scale them in my code.

superliuxz commented 5 years ago

Here I want to write a summarizing comment with what we have so far. The listing order is the same as in the React App.

A bar chart showing the number of movies at different ratings.
A line plot showing the trending of each genres with respect to the movie releasing year.
A bar chart showing the distribution of each genre with respect to the rating timestamp.
A word cloud summarizes the popular user assigned tags.

What we are still working on:

A node link diagram that is showing the top important links between any of the two genres, @SiRumCz is working on the API (GH-27), and @soroushysfi has drafted a prototype frontend.
A bar chart showing the top 5 most rated movies in each genres. @superliuxz has finished the API and PR GH-29 is open. No front end plot as yet but it should be similar to plot # 3.

Bugs:

GH-31, vertical axis labels
GH-32, word cloud is not showing up for 27M dataset.
GH-33, tooltips not showing for plot # 3.

@soroushysfi unfortunately all three bugs are associated with the front end, but please feel free asking for help. Please let us know if we can fix them or not' if not, we need to look for alternative solution ASAP.

What I think can be added (simple stuff):

A simple table with some stats about the movies and users:

# of movies movies releasing year # of users

1234 1900 - 2020 98765

# of movies	movies releasing year	# of users
1234	1900 - 2020	98765

Lastly, I think we should get everything done by Wednesday night, and stop adding new code after that (becoz new code -> possible new bug). Let's leave Thursday for recapping what we had, and finish the tech report.

SiRumCz / CSC501

Representations and queries #14