Network analysis - Githubissues

choldgraf commented 9 years ago

This issue is where we'll discuss the network analysis component. We can post graphs, code snippits, and brainstorms.

The network analysis project aims to find cluster of co-occurrence between departments, manufacturers, suppliers, product types, etc.

Project lead is @dariusmehri along with @nlin3330

choldgraf commented 9 years ago

Just had a potential idea for this project. On top of using co-occurrence to build graphs, we could define similarities between departments based on their seasonal purchasing. E.g., if I plot the average number of POs per department for each week of the year, you could get a plot like this.

X-axis is time, Y-axis is dept

Then, you can build this into a correlation matrix like this:

Then you could use the above plot to create edges between departments and see what falls out (there already looks to be some clear clustering here).

Just a thought

anthonysuen commented 9 years ago

I think this is a great idea!

Anthony

On Mon, Mar 16, 2015 at 9:29 AM, Chris Holdgraf notifications@github.com wrote:

Just had a potential idea for this project. On top of using co-occurrence to build graphs, we could define similarities between departments based on their seasonal purchasing. E.g., if I plot the average number of POs per department for each week of the year, you could get a plot like this.

X-axis is time, Y-axis is dept [image: image] https://cloud.githubusercontent.com/assets/1839645/6670878/b061af00-cbbe-11e4-8569-d5d2b53a9f8d.png

Then, you can build this into a correlation matrix like this: [image: image] https://cloud.githubusercontent.com/assets/1839645/6670895/d185902a-cbbe-11e4-89a8-e15e9ffb99be.png

Then you could use the above plot to create edges between departments and see what falls out (there already looks to be some clear clustering here).

Just a thought

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-81777889 .

Anthony Suen

testchange commented 9 years ago

Can you show the legend of the color code for the first graph(the color code for the second one is the correlation)? What does the bar chart with the lines on the top represents?

It is interesting to see that there is high correlation in the middle of the matrix. Maybe we can coordinate the seasonal buying among these departments.

choldgraf commented 9 years ago

Ah good point @kaiweitan, the color code is actually relatively arbitrary. Matplotlib chooses the colors to accentuate the differences in the data, so in this case "white" isn't necessarily 0. When we make a final output of this, then we will make sure to get the colors right.

For the second big correlation matrix, it's currently sorted according to the clustering trees that you see on the margins. We could define "cuts" of those trees as clusters, though doing this is a bit of a dark art. Definitely worth looking into.

nlin3330 commented 9 years ago

Quick update on the network analysis. I managed to subset the data by months and the graph now looks much cleaner. However, it still doesn't mean much without labels. The next step for me would be to color code the difference between suppliers and departments as well as add in labels. 3months

choldgraf commented 9 years ago

Very cool - we could choose different uses for colors. E.g., color code by manufacturer or supplier ID, rather than their category. That way we could see which organizations are persistently connected to others across time.

dariusmehri commented 9 years ago

hi nick, nice, but can you explain a bit how you subsetted by months? the dataset you have seems small, you mean it is for only one month? if so, which one?

one way to clean up the network graph is to remove nodes with centrality = 0, or with one tie, these ties are not of interest anyway and can be dropped

darius

On Thu, Mar 19, 2015 at 11:53 AM, nlin3330 notifications@github.com wrote:

Quick update on the network analysis. I managed to subset the data by months and the graph now looks much cleaner. However, it still doesn't mean much without labels. The next step for me would be to color code the difference between suppliers and departments as well as add in labels. [image: 3months] https://cloud.githubusercontent.com/assets/7124729/6738328/8998aa46-ce2e-11e4-81d0-3c7f93f86793.png

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83716261 .

Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley

nlin3330 commented 9 years ago

Hi Darius,

Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).

dariusmehri commented 9 years ago

hi nick, i see, i did the same exact thing for 2013, there is some issue with the dataset you are working with, the number of transactions is too low (i.e, it looks like there are only a few hundred transactions when there should be a about a hundred thousand or more), darius

On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com wrote:

Hi Darius,

Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83772275 .

Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley

dariusmehri commented 9 years ago

if you used drop_duplicates, there is a chance you may be throwing out too much data

On Thu, Mar 19, 2015 at 2:53 PM, Darius Mehri darius_mehri@berkeley.edu wrote:

hi nick, i see, i did the same exact thing for 2013, there is some issue with the dataset you are working with, the number of transactions is too low (i.e, it looks like there are only a few hundred transactions when there should be a about a hundred thousand or more), darius

On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com wrote:

Hi Darius,

Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).

— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83772275 .

Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley

Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley

choldgraf commented 9 years ago

hey @nlin3330 it looks like you worked on the color-coding stuff in a recent commit...do you have any interesting output / plots from that analysis, or still a work-in-progress?

dariusmehri commented 9 years ago

hey guys, i am back online to work on the project, sorry again for the absence, my objective is by the end of the week to get the group some nice network graphs and some hard data, nick is away all week (he is out of the country), but i will be in touch with him on and off

here are some of the plans:

i am going to try to trim out the nodes that have only one tie, this should bring down the clutter by a lot, if not then i will figure out a way to reduce it more
did you guys get the department codes yet? the reason i am asking is that in addition to the time dependent analysis, i think it will be useful to compare the structure of transactions by field/ and or department, i.e. compare the hard sciences, to engineering to the social sciences, and so on, i expect we can uncover some structural differences that will be very interesting.
get some more hard numbers (centrality, coherence, etc) and graph over time

BIDS-collaborative / purchasing

Network analysis #7