Closed jeremymanning closed 7 years ago
I know someone in the political science department; I wonder if he could provide some polling data from the election (though I'm not sure what insights we might gain)
@KirstensGitHub for election data, we can also use fivethirtyeight's data repository: https://github.com/fivethirtyeight/data
There are also some interesting datasets here: https://www.kaggle.com/datasets
And/or we could use any of our lab's neuroscience datasets on discovery
can you guys provide status updates for your datasets?
i'm working with the locations/landmarks dataset, which i've wrangled into a reasonable format. i also have code for downloading wikipedia articles about each dataset. that takes a long time. i can address that issue by limiting to only NY and CA landmarks (seems like a reasonable place to start). now i'd like to get topic vectors for each wikipedia article, ideally using an already-fitted topic model. i was hoping i could use some existing code for that...
Still working on mine! Out of a couple that I looked at, I think the mushroom classification might be the most interesting. I was thinking about plotting the different clusters with different symbols, and then changing the reduction technique to see how the clusters are moving around. We could use it as a way to highlight support for multiple reduction techniques. All of the scikit-learn dimensionality reduction techniques have the same API, so we could easily support them, and document the required API for custom reduction functions.
also open to exploring other datasets if this seems boring :)
I have monthly temperature averages for 19 cities over 150 years, but can add more easily. The static plot looks pretty good, but I still need to work on the animation, and changing the opacity of the trail to show the change over time.
Compiled our google doc into an Overleaf file: https://www.overleaf.com/7651544crgxsftzghby
is everyone clear what you're doing with your datasets to get them ready to integrate into the paper? i was planning to check in yesterday but forgot to... @andrewheusser @KirstensGitHub @lucywowen
Discussed during today's lab meeting:
RESULTS/FIGURES: mushrooms: tweak presentation order; projections (PCA, ICA, MDS, tSNE, etc.) --> poison vs. not poison cluster clearly, but additional clusters are also clear --> k-means clustering. figure 1: projections. figure 2: clustering.
temperatures: find a good projection for the trajectories. also use a colorbar to tag the years instead of using text tagging. figure 3: temperature paths with colorbar, possibly in two sub-panels to show the figure 8 structure and the smooth progression from blue to red. figure 4: panel A shows each PC by year; panel B shows the "increasing" PC by temperature.
education: figure 5: panel A shows the cloud of dots colored by performance label; panel B shows the swarm plot; panel C shows the cloud of dots re-colored as a gradient
tweets: figure 6: panel A shows cloud of dots (2d) with 6 days labeled (an HRC outlier, a Trump outlier, a HRC tweet that looks like a Trump tweet, a Trump tweet that looks like an HRC tweet, and two tweets (one Trump, one HRC) from the middle of the "V"); panel B shows the topics (top 5 words from the top 3 topics) for each of those 6 days; panel C shows a "representative example" tweet from each of those days.
Indiana Jones: Figure 7: panel A shows two brain trajectories (average of two randomly chosen groups of subjects); panel B shows the average brain trajectory and the average movie trajectory in "image" space. Several points of panel B could have callouts to the images they reflect-- e.g. (1) a point along the movie trajectory, (2) the corresponding point along the brain trajectory, and (3) a randomly chosen point far away from both trajectories.
ideally you should create these figures in Adobe Illustrator and upload them to the "figs" folder on overleaf. Then we can incorporate them into the paper as follows:
\begin{figure}[p] \centering \includegraphics[width=0.5\textwidth]{figs/FILENAME.pdf} \caption{\textbf{Caption title.} Description of figure.} \label{fig:FIGURELABEL} \end{figure}
INTRODUCTION: @KirstensGitHub is going to take a stab at writing a draft of the introduction (target length: 1 page(ish))
METHODS: @andrewheusser is going to take a stab at writing a draft of the methods. This may also involve making a figure illustrating the toolbox organization.
ideally we're going to have these things done by next week (lab meeting on January 18).
our target is still to submit this to bioarXiv by 1/27 so that we can also release the toolbox and submit an abstract to CEMS.
temperature: add panels showing PC1 correlated with year and PC1 correlated with temperature. then maybe combine everything into a single figure?
things to do:
1.) touch up figures 2.) finish everyone's results (finish text, incorporate figure, make sure there are no "to do" pointers or citations) @KirstensGitHub finish hypercubes example 3.) introduction (@KirstensGitHub) -- finish anscombe figure, fill in remaining citations 4.) abstract @KirstensGitHub 5.) discussion @andrewheusser
timeline:
left to do on text:
jeremy go through global warming example
jeremy go through decoding example
jeremy go through discussion
jeremy go through abstract
final pass through text (andy + jeremy)
fix some minor formatting things in bibtex (jeremy)
super anal things:
@lucywowen or @andrewheusser -- for the global warming figure, can you change "Celsius" on the y-axis to "°C" ?
closing...'cause we're awesome
After our OpenBCI Hackathon and whatever other polishing we want to do, we should write this up as a brief report in an appropriate forum (e.g. Nature Methods, Journal of Neuroscience Methods, arXiv, PLoS One), and then we should release the code. We should show how we can visualize a few interesting public datasets and use those visualizations to gain insights into the structure of the data. (They could be neuroscience datasets or not; the precise application will also help us narrow down a forum for reporting.)
Proposed title: The Geometry of Big Data