Closed choldgraf closed 8 years ago
Just had a potential idea for this project. On top of using co-occurrence to build graphs, we could define similarities between departments based on their seasonal purchasing. E.g., if I plot the average number of POs per department for each week of the year, you could get a plot like this.
X-axis is time, Y-axis is dept
Then, you can build this into a correlation matrix like this:
Then you could use the above plot to create edges between departments and see what falls out (there already looks to be some clear clustering here).
Just a thought
I think this is a great idea!
Anthony
On Mon, Mar 16, 2015 at 9:29 AM, Chris Holdgraf notifications@github.com wrote:
Just had a potential idea for this project. On top of using co-occurrence to build graphs, we could define similarities between departments based on their seasonal purchasing. E.g., if I plot the average number of POs per department for each week of the year, you could get a plot like this.
X-axis is time, Y-axis is dept [image: image] https://cloud.githubusercontent.com/assets/1839645/6670878/b061af00-cbbe-11e4-8569-d5d2b53a9f8d.png
Then, you can build this into a correlation matrix like this: [image: image] https://cloud.githubusercontent.com/assets/1839645/6670895/d185902a-cbbe-11e4-89a8-e15e9ffb99be.png
Then you could use the above plot to create edges between departments and see what falls out (there already looks to be some clear clustering here).
Just a thought
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-81777889 .
Anthony Suen
Can you show the legend of the color code for the first graph(the color code for the second one is the correlation)? What does the bar chart with the lines on the top represents?
It is interesting to see that there is high correlation in the middle of the matrix. Maybe we can coordinate the seasonal buying among these departments.
Ah good point @kaiweitan, the color code is actually relatively arbitrary. Matplotlib chooses the colors to accentuate the differences in the data, so in this case "white" isn't necessarily 0. When we make a final output of this, then we will make sure to get the colors right.
For the second big correlation matrix, it's currently sorted according to the clustering trees that you see on the margins. We could define "cuts" of those trees as clusters, though doing this is a bit of a dark art. Definitely worth looking into.
Quick update on the network analysis. I managed to subset the data by months and the graph now looks much cleaner. However, it still doesn't mean much without labels. The next step for me would be to color code the difference between suppliers and departments as well as add in labels.
Very cool - we could choose different uses for colors. E.g., color code by manufacturer or supplier ID, rather than their category. That way we could see which organizations are persistently connected to others across time.
hi nick, nice, but can you explain a bit how you subsetted by months? the dataset you have seems small, you mean it is for only one month? if so, which one?
one way to clean up the network graph is to remove nodes with centrality = 0, or with one tie, these ties are not of interest anyway and can be dropped
darius
On Thu, Mar 19, 2015 at 11:53 AM, nlin3330 notifications@github.com wrote:
Quick update on the network analysis. I managed to subset the data by months and the graph now looks much cleaner. However, it still doesn't mean much without labels. The next step for me would be to color code the difference between suppliers and departments as well as add in labels. [image: 3months] https://cloud.githubusercontent.com/assets/7124729/6738328/8998aa46-ce2e-11e4-81d0-3c7f93f86793.png
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83716261 .
Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley
Hi Darius,
Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).
hi nick, i see, i did the same exact thing for 2013, there is some issue with the dataset you are working with, the number of transactions is too low (i.e, it looks like there are only a few hundred transactions when there should be a about a hundred thousand or more), darius
On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com wrote:
Hi Darius,
Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83772275 .
Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley
if you used drop_duplicates, there is a chance you may be throwing out too much data
On Thu, Mar 19, 2015 at 2:53 PM, Darius Mehri darius_mehri@berkeley.edu wrote:
hi nick, i see, i did the same exact thing for 2013, there is some issue with the dataset you are working with, the number of transactions is too low (i.e, it looks like there are only a few hundred transactions when there should be a about a hundred thousand or more), darius
On Thu, Mar 19, 2015 at 2:47 PM, nlin3330 notifications@github.com wrote:
Hi Darius,
Basically I converted the creation_time variable into a datetime variable which allows ease in specifying a range of dates. For the graph this was the first three months of the current data (1/1/2012-3/1/2012).
— Reply to this email directly or view it on GitHub https://github.com/berkeley-dsc/purchasing/issues/7#issuecomment-83772275 .
Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley
Darius Mehri Ph.D. Candidate, Sociology University of California, Berkeley
hey @nlin3330 it looks like you worked on the color-coding stuff in a recent commit...do you have any interesting output / plots from that analysis, or still a work-in-progress?
hey guys, i am back online to work on the project, sorry again for the absence, my objective is by the end of the week to get the group some nice network graphs and some hard data, nick is away all week (he is out of the country), but i will be in touch with him on and off
here are some of the plans:
This issue is where we'll discuss the network analysis component. We can post graphs, code snippits, and brainstorms.
The network analysis project aims to find cluster of co-occurrence between departments, manufacturers, suppliers, product types, etc.
Project lead is @dariusmehri along with @nlin3330