ContextLab / hypertools

A Python toolbox for gaining geometric insights into high-dimensional data
http://hypertools.readthedocs.io/en/latest/
MIT License
1.81k stars 161 forks source link

writeup #9

Closed jeremymanning closed 7 years ago

jeremymanning commented 7 years ago

After our OpenBCI Hackathon and whatever other polishing we want to do, we should write this up as a brief report in an appropriate forum (e.g. Nature Methods, Journal of Neuroscience Methods, arXiv, PLoS One), and then we should release the code. We should show how we can visualize a few interesting public datasets and use those visualizations to gain insights into the structure of the data. (They could be neuroscience datasets or not; the precise application will also help us narrow down a forum for reporting.)

Proposed title: The Geometry of Big Data

KirstensGitHub commented 7 years ago

I know someone in the political science department; I wonder if he could provide some polling data from the election (though I'm not sure what insights we might gain)

jeremymanning commented 7 years ago

@KirstensGitHub for election data, we can also use fivethirtyeight's data repository: https://github.com/fivethirtyeight/data

There are also some interesting datasets here: https://www.kaggle.com/datasets

And/or we could use any of our lab's neuroscience datasets on discovery

jeremymanning commented 7 years ago

can you guys provide status updates for your datasets?

i'm working with the locations/landmarks dataset, which i've wrangled into a reasonable format. i also have code for downloading wikipedia articles about each dataset. that takes a long time. i can address that issue by limiting to only NY and CA landmarks (seems like a reasonable place to start). now i'd like to get topic vectors for each wikipedia article, ideally using an already-fitted topic model. i was hoping i could use some existing code for that...

andrewheusser commented 7 years ago

Still working on mine! Out of a couple that I looked at, I think the mushroom classification might be the most interesting. I was thinking about plotting the different clusters with different symbols, and then changing the reduction technique to see how the clusters are moving around. We could use it as a way to highlight support for multiple reduction techniques. All of the scikit-learn dimensionality reduction techniques have the same API, so we could easily support them, and document the required API for custom reduction functions.

andrewheusser commented 7 years ago

also open to exploring other datasets if this seems boring :)

lucywowen commented 7 years ago

I have monthly temperature averages for 19 cities over 150 years, but can add more easily. The static plot looks pretty good, but I still need to work on the animation, and changing the opacity of the trail to show the change over time.

jeremymanning commented 7 years ago

Compiled our google doc into an Overleaf file: https://www.overleaf.com/7651544crgxsftzghby

jeremymanning commented 7 years ago

is everyone clear what you're doing with your datasets to get them ready to integrate into the paper? i was planning to check in yesterday but forgot to... @andrewheusser @KirstensGitHub @lucywowen

jeremymanning commented 7 years ago

Discussed during today's lab meeting:

RESULTS/FIGURES: mushrooms: tweak presentation order; projections (PCA, ICA, MDS, tSNE, etc.) --> poison vs. not poison cluster clearly, but additional clusters are also clear --> k-means clustering. figure 1: projections. figure 2: clustering.

temperatures: find a good projection for the trajectories. also use a colorbar to tag the years instead of using text tagging. figure 3: temperature paths with colorbar, possibly in two sub-panels to show the figure 8 structure and the smooth progression from blue to red. figure 4: panel A shows each PC by year; panel B shows the "increasing" PC by temperature.

education: figure 5: panel A shows the cloud of dots colored by performance label; panel B shows the swarm plot; panel C shows the cloud of dots re-colored as a gradient

tweets: figure 6: panel A shows cloud of dots (2d) with 6 days labeled (an HRC outlier, a Trump outlier, a HRC tweet that looks like a Trump tweet, a Trump tweet that looks like an HRC tweet, and two tweets (one Trump, one HRC) from the middle of the "V"); panel B shows the topics (top 5 words from the top 3 topics) for each of those 6 days; panel C shows a "representative example" tweet from each of those days.

Indiana Jones: Figure 7: panel A shows two brain trajectories (average of two randomly chosen groups of subjects); panel B shows the average brain trajectory and the average movie trajectory in "image" space. Several points of panel B could have callouts to the images they reflect-- e.g. (1) a point along the movie trajectory, (2) the corresponding point along the brain trajectory, and (3) a randomly chosen point far away from both trajectories.

ideally you should create these figures in Adobe Illustrator and upload them to the "figs" folder on overleaf. Then we can incorporate them into the paper as follows:

\begin{figure}[p] \centering \includegraphics[width=0.5\textwidth]{figs/FILENAME.pdf} \caption{\textbf{Caption title.} Description of figure.} \label{fig:FIGURELABEL} \end{figure}

INTRODUCTION: @KirstensGitHub is going to take a stab at writing a draft of the introduction (target length: 1 page(ish))

METHODS: @andrewheusser is going to take a stab at writing a draft of the methods. This may also involve making a figure illustrating the toolbox organization.

ideally we're going to have these things done by next week (lab meeting on January 18).

our target is still to submit this to bioarXiv by 1/27 so that we can also release the toolbox and submit an abstract to CEMS.

jeremymanning commented 7 years ago

temperature: add panels showing PC1 correlated with year and PC1 correlated with temperature. then maybe combine everything into a single figure?

jeremymanning commented 7 years ago

things to do:

1.) touch up figures 2.) finish everyone's results (finish text, incorporate figure, make sure there are no "to do" pointers or citations) @KirstensGitHub finish hypercubes example 3.) introduction (@KirstensGitHub) -- finish anscombe figure, fill in remaining citations 4.) abstract @KirstensGitHub 5.) discussion @andrewheusser

timeline:

jeremymanning commented 7 years ago

left to do on text:

super anal things:

jeremymanning commented 7 years ago

@lucywowen or @andrewheusser -- for the global warming figure, can you change "Celsius" on the y-axis to "°C" ?

jeremymanning commented 7 years ago

closing...'cause we're awesome