SiRumCz / CSC501

CSC501 assignments
0 stars 1 forks source link

Clustering or downsampling #69

Open soroushysfi opened 4 years ago

soroushysfi commented 4 years ago

Clustering To visualize the data we should do some clustering or down sampling. We could use some algorithms like K-means or t-sne(which is a machine learning algorithm for dimension reduction). We don't need to implement it I found some links that we can import them in our project and use them.

K-means: http://benalexkeen.com/k-means-clustering-in-python/ T-sne: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Down sampling For down sampling we can use the methods or combination of these:

  1. Top 20,200,... nodes that have the highest connections
  2. Filtering out nodes based on properties(Average word length, Number of characters, ...)

I will update this issue to indicate what format I will need the data to visualize them.

jonhealy1 commented 4 years ago

Sounds good.

soroushysfi commented 4 years ago

Edge Bundled For edge bundled we need this sort of structure(the names used are from a sample I found, not the data we're using):

[
{"name":"flare.analytics.cluster.AgglomerativeCluster","size":3938,"imports":["flare.animate.Transitioner","flare.vis.data.DataList","flare.util.math.IMatrix","flare.analytics.cluster.MergeEdge","flare.analytics.cluster.HierarchicalCluster","flare.vis.data.Data"]},
{"name":"flare.analytics.cluster.CommunityStructure","size":3812,"imports":["flare.analytics.cluster.HierarchicalCluster","flare.animate.Transitioner","flare.vis.data.DataList","flare.analytics.cluster.MergeEdge","flare.util.math.IMatrix"]}
...
]

What they did here is that they added the cluster to the start of the nodes names, e.g., flare.analytics.cluster is the cluster and AgglomerativeCluster is the name of the node. import property is the outgoing edges(we won't be needing ingoing edges for every node because if we include all the outgoing edges the ingoing ones will be counted). Or if a list separately provided mentioning the clusters for nodes that would be fine too. I would aggregate it myself.

Chord Diagram Chords diagram uses simple matrix input. like a 2d array:

[
  [1, 2, 0, 4, 6],
  [4, 5, 3, 0, 1],
  [4, 5, 3, 0, 1],
  [4, 5, 3, 0, 1],
  [4, 5, 3, 0, 1],
]

This should be a square matrix. Also a separate list showing the name of each node: ["node1", "node2", "node3", ...]

Matrix and Node-Link Diagram For both of these diagrams a json containing the nodes and links would do the job: { nodes:[ {"name": "node1", "group": 0}, {"name": "node2", "group": 0}, {"name": "node3", "group": 0}, ... ], links:[ {"source": "1", "target": "0", "value": 166}, {"source": "2", "target": "0", "value": 181}, {"source": "3", "target": "0", "value": 79}, {"source": "4", "target": "0", "value": 3}, ... ] }

The group and value properties are optional. We can assign properties from data like average word count to link value. If we have the clustering we can add them to groups.

jonhealy1 commented 4 years ago

Is there a link to this information?

soroushysfi commented 4 years ago

Is there a link to this information?

Yes. This is how generally d3.js works. It needs the data to be in a specific format to visualize them. Edge Bundled Chord Diagram Matrix Node-Link

SiRumCz commented 4 years ago

I will set up flask server tomorrow.

jonhealy1 commented 4 years ago

For the edge bundled graph I have no ideas on how we can put our data in that format. Could you try to run like a mock up maybe with some sample data that may look like ours?

I'm running a page rank type of thing and it's literally taken like half an hour so far.

jonhealy1 commented 4 years ago

--- Eigenvector Centrality --- subreddit score 0 iama 321.911086 1 askreddit 311.323265 2 pics 263.799529 3 funny 258.084657 4 videos 253.007776 5 todayilearned 231.063511 6 worldnews 184.428278 7 gaming 182.708345 8 news 180.482415 9 gifs 165.655387

SiRumCz commented 4 years ago

--- Eigenvector Centrality --- subreddit score 0 iama 321.911086 1 askreddit 311.323265 2 pics 263.799529 3 funny 258.084657 4 videos 253.007776 5 todayilearned 231.063511 6 worldnews 184.428278 7 gaming 182.708345 8 news 180.482415 9 gifs 165.655387

wow, that is impressive, could you explain a little bit more what those scores represent?

jonhealy1 commented 4 years ago

Eigenvector Centrality is an algorithm that measures the transitive influence or connectivity of nodes.

Relationships to high-scoring nodes contribute more to the score of a node than connections to low-scoring nodes. A high score means that a node is connected to other nodes that have high scores.

https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/eigenvector-centrality/

jonhealy1 commented 4 years ago

I'm just learning about it now.

soroushysfi commented 4 years ago

Cool!

--- Eigenvector Centrality --- subreddit score 0 iama 321.911086 1 askreddit 311.323265 2 pics 263.799529 3 funny 258.084657 4 videos 253.007776 5 todayilearned 231.063511 6 worldnews 184.428278 7 gaming 182.708345 8 news 180.482415 9 gifs 165.655387

Cool! The strings are nodes names? what are the numbers?

jonhealy1 commented 4 years ago

Eigenvector Centrality is an algorithm that measures the transitive influence or connectivity of nodes. Relationships to high-scoring nodes contribute more to the score of a node than connections to low-scoring nodes. A high score means that a node is connected to other nodes that have high scores.

https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/eigenvector-centrality/