SiRumCz / CSC501

CSC501 assignments
0 stars 1 forks source link

Edge bundled diagram #80

Open jonhealy1 opened 4 years ago

jonhealy1 commented 4 years ago

This is a really cool visualization. Like with the first chord diagram, I am a little confused. What is this saying?:

{"name":"flare.query.IsA","size":2039,"imports":["flare.query.Expression","flare.query.If"]},

'IsA' connects to 'Expresson' and 'If' but what does the number 2039 mean in this context.

soroushysfi commented 4 years ago

The name part has two sections; one is the name and one is the cluster. e.g. flare.analytics.cluster.AgglomerativeCluster => flare.analytics.cluster is the cluster and AgglomerativeCluster is the name of that node. The cluster is to put the nodes near each other. If you look closely you can see all the nodes that have flare.analytics.cluster in the first part of their name are located in one place. Imports are the outgoing nodes from that specific node. The imports for the node flare.analytics.cluster.AgglomerativeCluster are the outgoing edges for this node.

SiRumCz commented 4 years ago

what does the size mean? If it is the weight of a node, how is it perceived from the graph tho?

jonhealy1 commented 4 years ago

How are we going to assemble this data?

SiRumCz commented 4 years ago

How are we going to assemble this data?

That is what I am trying to figure out right now.

First we have the Label propagation algorithm for community detection which returns something like this

community_name    size    subreddit_list:

label   size                                         subreddits
0      2  38672  [rddtgaming, xboxone, ps4, fitnesscirclejerk, ...
1  43708    856  [bar, jokesofthedadvariety, redpower, dumb, gl...
2  43709    218  [pocket_universe, shootmyshort, thestoryboard,...
3  43746     98  [thuglifeprotips, battlefieldloadouts, myrovia...
4  43704     82  [funnyfartstories, ishatmyself, friendzone, br...

What I am trying to do is I would want to set every nodes in Subreddit the name of community their belong to, then I would like to have only the nodes that have their communities belong to top 5 or 10 communities. From there, we could assemble the list of the outgoing link as the imports. And perhaps. the number of outgoing links to be the size, the names is the s.community+'.'+s.id.

Ultimately the result could be this:

name                size        imports
2.reddtgaing    37222                  [2.xbox, 1232.sdasdada, 3123.dasdada, ...
47582.dsdsd        123                         [...
jonhealy1 commented 4 years ago

The lpa algorithm Kevin did is super useful. What if we ran it on say the top 100 nodes overall and assembled communities from that result?

SiRumCz commented 4 years ago

@noonespecial009 is it possible to create subset of current Subreddit nodes?

jonhealy1 commented 4 years ago

There must be yea.

SiRumCz commented 4 years ago

Also I think there is some duplicate nodes in the nodes. When I run

MATCH (s)-->()
WHERE s.id = 'rddtgaming'
RETURN count(*)

I got 3 as results, which is showing that there are three nodes have id rddtgaming

jonhealy1 commented 4 years ago

This creates a subset of random nodes.

match (:Document) with count(*) as docCount match (doc:Document) where rand() < 10.0/docCount return doc.title

jonhealy1 commented 4 years ago

Also I think there is some duplicate nodes in the nodes. When I run

MATCH (s)-->()
WHERE s.id = 'rddtgaming'
RETURN count(*)

I got 3 as results, which is showing that there are three nodes have id rddtgaming

I have this in load_data, I though it would prevent that?

print(graph.run("CREATE CONSTRAINT ON (s:Subreddit) ASSERT s.id IS UNIQUE;"))

SiRumCz commented 4 years ago

do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.

SiRumCz commented 4 years ago

This creates a subset of random nodes.

match (:Document) with count(*) as docCount match (doc:Document) where rand() < 10.0/docCount return doc.title

I would also hope it could write the results back to the database as well. We can run either eigenvector or pagerank to get top 500 nodes, then we run lpa on these 500 nodes subset to get their clusters.

jonhealy1 commented 4 years ago

do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.

Yep. I got 3 too.

jonhealy1 commented 4 years ago

Using pagerank to find the top nodes is a great idea. We need to save the results back into the database. It shouldn't be hard but I'm not exactly sure off the top of my head.

SiRumCz commented 4 years ago

do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.

Yep. I got 3 too.

I just figured it out, it because i used (s)-->() that this thing occurred three times. if I run MATCH (s) the result is 1. You are correct, there shouldn’t be any duplicate.

jonhealy1 commented 4 years ago

https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/

I think this is what we need: projected graph models

soroushysfi commented 4 years ago

what does the size mean? If it is the weight of a node, how is it perceived from the graph tho?

From what I got is a mean of all the outgoing edges. I couldn't find any other explanation.