Brown University, Fall 2015
BIOL2430-S04 (CRN:14763)
Topics in Ecology and Evolutionary Biology
Fridays 1-2:50p
Khoo Multimedia Lab (Room N320), Granoff Center
Instructor: Casey Dunn
Office hours: Monday 1:00-2:30PM, Room 301, Walter Hall (80 Waterman St.)
Prepend the subject line of all course related emails with "data: "
Science is becoming more data intensive, and at the same time new tools are allowing scientists to interact with data in new ways. This seminar will explore the potential impacts of these new ways of interacting with data on the practice of science, how these approaches can be used most effectively (with an emphasis on design and human perception), and introduce students to some tools that embody these changes, including version control (https://git-scm.com/), executable manuscripts (http://yihui.name/knitr/), and interactive visualizations (http://d3js.org/).
Please complete the survey if you register for or intend to sit in on the seminar.
This course is organized with github education tools.
Classes will consist of discussions, labs that examine particular tools tools, and student presentations. The schedule includes topics for each class, and conversation points to seed class discussions. There will be a strong focus on human perception as it relates to insight and on principles of design.
After the first meeting, and prior to the project presentations, the discussion for each class will be led by one or more students. These students will meet with me during office hours on the Monday preceding the class to map out the plan for the discussion.
All materials for the course, including the syllabus, are available at the course site. The syllabus will be updated as the course progresses, please check it weekly. Please submit suggestions and corrections for the class via the issue tracker.
Each student will create an interactive analysis/visualization based on their own work, publicly available data, or a published scientific paper. This project will presented in class at the end of the course.
Final projects will be developed and submitted in a git repository. Please fork the boilerplate repository for the assignment, and follow section 3 of these instructions. After you fork the repository, please enable the issue tracker in the repository settings so that others in the class (including the professor) can provide feedback.
The preferred approach is to work on your final project in a public repository to make it easy for everyone to see it. If you have unpublished data that you don't want to put in a public repository, please talk with me and we'll come up with a solution.
Reading includes book chapters, online resources, and videos to be watched ahead of class. The dates the readings will be discussed in class are listed in the schedule, but some will be useful to you much earlier as you work on your projects. In addition, the reading load is very uneven. On light weeks, it is good to get a jump on reading for future weeks.
Tufte, ER (2001). The Visual Display of Quantitative Information, 2nd edition. amazon
Murray, S (2013). Interactive Data Visualization for the Web. online
Haddock, SHD and CW Dunn (2011). Practical Computing for Biologists. amazon
Reading: Murray - chapters 1, 2 Assignment: In the next couple days, use the issue tracker to submit a visualization or two that you particularly like.
Intro to class, description of final projects
Investigators interact with data in several ways:
Designing analyses, implementing analyses, and running analyses are often treated as different tasks. In many studies, the investigator moves data through each stage of analysis by hand, which takes a long time and is error prone. Automated and interactive analyses separate the design/implementation of analyses from running analyses, and make running analyses very easy and reliable. This means that you can repeatedly run analyses before you collect your data (using simulations or other datasets), as you collect your data (to check data quality and assess how many more data are needed), and after you collect your data (to refine and extend analyses). This doesn't just speed up analyses, it fundamentally changes the way they are approached.
There used to be a small number of models of scientific publication, now there are many.
These models vary in a few key dimensions:
Other recent developments:
What are missing publication models?
Reading - Haddock and Dunn, chapter 4 and new chapter
Walk through git example
Intro to markdown
Discuss participant-submitted visualizations
Reading - Tidy Data- http://vita.had.co.nz/papers/tidy-data.html ; Haddock and Dunn, chapter 1-3, pages 255-260; Murray chapter 3
Before class, install:
See Haddock & Dunn Figure 15.1 for examples of messy and tidy data.
Key points from Wickham's Tidy data paper:
Tidy data have the following properties: Each variable forms a column, each observation forms a row, each type of overvational unit forms a table.
Melting is the process of stacking data such that data from different columns is then in different rows. it results in a dataset with fewer columns and more rows. A completely moltedn dataset is a triple store (three columns that store object, attribute, and value).
Casting is the process of unstacking data so that information for a single observation that is spread across multiple rows is placed in multiple columns.
Tidy tools input tudy data and output tidy data. If you have tidy data and tidy tools, no munging is needed to format the output of one analysis step so that it is ready for the next analysis step. i.e., all your code relates is focused on analysis.
Visualization is the process of mapping variables to aesthetic attributes of a graph (eg, defining which variables specify position).
See regex
folder.
To view web sites locally, rather than just double click the html file it is best to run them through a web server. This makes sure that javascript etc renders correctly. The simplest is python's simple server:
cd website_dir/
python -m SimpleHTTPServer
Where website_dir
is the directory with your site files. Once it is running, enter the url (eg http://localhost:8000/) into your prowser to see the rendered page.
Download the example code for the Murray book expand it, then cd
to the folder and launch the simple server. Explore the examples in your browser.
Reading: Murray chapters 2,4,5,6
Before class, install:
The two topics today have a lot in common. We embed R code in markdown documents that changes the document according to data when executed by knitr. We embed d3 javascript in html documents that changes the document according to data when executed by the browser.
Markdown
pandoc
is a great tool for rendering markdown to other formats. Can accommodate bibliographic data as well.pandoc
flavored markdown. Includes functionality relevant to academics, eg bibliographies.Executable manuscripts with knitr
.rmd
extension.knitr
package, which renders code and code output to latex or markdown. This produces a document that is just marked up text.eval
specifies if the code is evaluated (ie, run). echo
controls if the code itself is included in the final document. include
controls if the results of running hte code (eg plots) are shown in the document.knitr
documentation.There are a variety of great online courses for learning javascript. If you don't have experience with javascript, check them out. See, for example, the courses at code agademy and code school.
The basics of drawing with data.
Reading: Tufte (the whole book); Haddock and Dunn chapters 17-19
Visualization is the act of mapping data to aesthetic properties. The principles of design clarify which aesthetic properties we have to work with.
Some examples:
Different participants will discuss different chapters:
Counter point - Jer Thorp: I have millions of pixels
Thoughts:
Best practices for modern media:
Reading: Tufte (all), Murray chapters 7-9.
Go through remaining Tufte chapters
Overview of scales, axes, and transitions in d3. Exercises to manipulate code, starting with iris.
Reading: Murray (the whole book); Shneiderman 1996 - "Visual Information-Seeking Mantra: overview first, zoom and filter, then details on demand."
Watch in advance of class:
In class:
Go over peoples' exercises.
Dynamic interactions:
A spectrum of approaches:
Exploration needs to be very unconstrained, exposition requires that the author direct the audience perspective through constraint. The extreme of constrained dynamic perspective is a video.
Imagine a VR movie without any constraint, where the audience could roam anywhere they like. They would be far from all the key action and have no idea what the movie was "about". Maybe the best exposition is fully constrained. These trade-offs are well illustrated by http://www.fallen.io/ww2/ .
Tropes - zoom and enhance.
Watch in advance of class:
Some other videos:
https://vimeo.com/36579366 "Creaters need an immediate connection to what they create... if you make a decision you need to see the effect of that right away."
https://vimeo.com/66085662 Excel is fixed structure, fill in our data dynamically. Illustrator is direct manipulation of flexible structure, not dynamically mapped to data. D3 is indirect manipulation of flexible structure with dynamic data interaction. Demonstrates the missing tool - direct manipulation of flexible structure with dynamic data interaction.
https://vimeo.com/67076984 Science - thinking that goes from system to theory. Engineering - thinking that goes from theory to system. We build instruments that adapt phenomena we can't percieve to our senses. We need tools to adapt unthinkable thoughts to the way our minds work.
Guest lecture - Mark Howison.
An introduction to Google Cardboard, including existing development tools.
VR does a few things:
Augmented reality:
You can craft special urls to load html files in git repos as web pages, eg:
https://github.com/antropoteuthis/finalproject/blob/master/ISCPhyloecospace.html # github url
https://rawgit.com/antropoteuthis/finalproject/master/ISCPhylomorphospace.html # as web page
A couple ways to propose changes/ fixes/ suggestions:
Guest lecture by Max Leiserson about his project MAGI.
Please complete the first draft of your readme, and come to class prepared to talk for five minutes aabout the goals, status, and current challenges of your projects.
Project presentations
Visit by Sohini Ramachandran.
Project presentations
http://99percentinvisible.org/episode/future-screens-are-mostly-blue/
http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/
One of the most common set of bugs in d3 is due to the asynchronous nature of javascript. Don't assume that because you called code to load your data, that they are available when you go to use them. This is why it is best to call your visualization code from within the anonymous function called by your data loading, eg d3.csv("data.csv", function(d){ visualize data here })
. If you are unsure whether your data are available, inspect them with console.log()
right before you use them to see if they are populated.
select()
only selects the first matching element. If you want to do more, use selectAll()
.
pca()
function in http://bl.ocks.org/ktaneishi/9499896 .