Open jonhealy1 opened 4 years ago
The name part has two sections; one is the name and one is the cluster. e.g. flare.analytics.cluster.AgglomerativeCluster => flare.analytics.cluster is the cluster and AgglomerativeCluster is the name of that node. The cluster is to put the nodes near each other. If you look closely you can see all the nodes that have flare.analytics.cluster in the first part of their name are located in one place. Imports are the outgoing nodes from that specific node. The imports for the node flare.analytics.cluster.AgglomerativeCluster are the outgoing edges for this node.
what does the size mean? If it is the weight of a node, how is it perceived from the graph tho?
How are we going to assemble this data?
How are we going to assemble this data?
That is what I am trying to figure out right now.
First we have the Label propagation algorithm for community detection
which returns something like this
community_name size subreddit_list:
label size subreddits
0 2 38672 [rddtgaming, xboxone, ps4, fitnesscirclejerk, ...
1 43708 856 [bar, jokesofthedadvariety, redpower, dumb, gl...
2 43709 218 [pocket_universe, shootmyshort, thestoryboard,...
3 43746 98 [thuglifeprotips, battlefieldloadouts, myrovia...
4 43704 82 [funnyfartstories, ishatmyself, friendzone, br...
What I am trying to do is I would want to set every nodes in Subreddit the name of community their belong to, then I would like to have only the nodes that have their communities belong to top 5 or 10 communities. From there, we could assemble the list of the outgoing link as the imports
. And perhaps. the number of outgoing links to be the size
, the names is the s.community+'.'+s.id
.
Ultimately the result could be this:
name size imports
2.reddtgaing 37222 [2.xbox, 1232.sdasdada, 3123.dasdada, ...
47582.dsdsd 123 [...
The lpa algorithm Kevin did is super useful. What if we ran it on say the top 100 nodes overall and assembled communities from that result?
@noonespecial009 is it possible to create subset of current Subreddit
nodes?
There must be yea.
Also I think there is some duplicate nodes in the nodes. When I run
MATCH (s)-->()
WHERE s.id = 'rddtgaming'
RETURN count(*)
I got 3 as results, which is showing that there are three nodes have id rddtgaming
This creates a subset of random nodes.
match (:Document) with count(*) as docCount match (doc:Document) where rand() < 10.0/docCount return doc.title
Also I think there is some duplicate nodes in the nodes. When I run
MATCH (s)-->() WHERE s.id = 'rddtgaming' RETURN count(*)
I got 3 as results, which is showing that there are three nodes have id
rddtgaming
I have this in load_data, I though it would prevent that?
print(graph.run("CREATE CONSTRAINT ON (s:Subreddit) ASSERT s.id IS UNIQUE;"))
do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.
This creates a subset of random nodes.
match (:Document) with count(*) as docCount match (doc:Document) where rand() < 10.0/docCount return doc.title
I would also hope it could write the results back to the database as well. We can run either eigenvector
or pagerank
to get top 500 nodes, then we run lpa
on these 500 nodes subset to get their clusters.
do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.
Yep. I got 3 too.
Using pagerank to find the top nodes is a great idea. We need to save the results back into the database. It shouldn't be hard but I'm not exactly sure off the top of my head.
do you get the same result in your database? If not, maybe I need to wipe my database and reload the data again.
Yep. I got 3 too.
I just figured it out, it because i used (s)-->()
that this thing occurred three times.
if I run MATCH (s)
the result is 1. You are correct, there shouldn’t be any duplicate.
https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/
I think this is what we need: projected graph models
what does the size mean? If it is the weight of a node, how is it perceived from the graph tho?
From what I got is a mean of all the outgoing edges. I couldn't find any other explanation.
This is a really cool visualization. Like with the first chord diagram, I am a little confused. What is this saying?:
{"name":"flare.query.IsA","size":2039,"imports":["flare.query.Expression","flare.query.If"]},
'IsA' connects to 'Expresson' and 'If' but what does the number 2039 mean in this context.