Udacity - Data Analyst Nanodegree - Project 6 - Data Visualization
I am examining the average arrival delays by select major airlines in select major US destination cities between the years 2000-2008 by day of week. I want to examine which days of the week experienced the most and least arrival delays.
What I found was 2000 seemed to be the worse for arrival delays with several airlines having delays on average over 10 minutes, United being the worse with over 20 minute delays.
From 2000 - 2003, you see a steady improvement in arrival delays, with even the worse offenders - Continental and Delta Airlines - coming in under 6 minutes on average. Unfortunately, 2004 - 2008 saw a continual increase for all airlines except Southwest Airlines, with several airlines averaging between 10 - 18 minutes.
From a day per week standpoint, all airlines seemed to have their lowest delay times on Tuesdays and Saturdays, and their highest delay times on Thursdays and Fridays. Which makes sense based on the typical busy travel days matching this finding.
I have taken the Flight data from http://stat-computing.org/dataexpo/2009/the-data.html/ and its supplemental data (carriers.csv/airports.csv) and combined data from the 3 sources. I then filtered down to a small subset of carriers and destination cities as listed below it. The reason for filtering the data was due to the large dataset (over 2GB even filtered). From performance posts I found researching dimplejs and d3js, I've seen posts saying 11MB as being excessive. Instead I pre-processed the data using the append_csv.py script and calculated the average daily arrival delay by carrier, airport, day of week and year. I was originally working with a csv file, but in researching performance issues, found the suggestion of using a json input file to limit the parsing phase required from converting csv to a json object:
Code | Description |
---|---|
AA | American Airlines Inc. |
CO | Continental Air Lines Inc. |
DL | Delta Air Lines Inc. |
UA | United Air Lines Inc. |
WN | Southwest Airlines Co. |
State | Cities |
---|---|
"NY" | ["New York"] |
"IL" | ["Chicago"] |
"TX" | ["Dallas","Dallas-FtWorth","Houston","Austin"] |
"CA" | ["Los Angeles", "San Francisco"] |
"GA" | ["Atlanta"] |
"FL" | ["Miami", "Orlando"] |
"MA" | ["Boston"] |
"VA" | ["Arlington","Chantilly"] |
I added long/lat data as well as merged the airport and carrier name information into a single flightdelays.json file with the following fields:
Field Name | Field Description |
---|---|
Dest long | longitude of destination airport |
Dest lat | lat of destination airport |
Dest airport | airport name of destination |
ArrDelay | Arrival Delay in minutes |
UniqueCarrierName | Air Carrier Name |
DayOfWeek | Mon, Tues, Wed, Thu, Fri, Sat, Sun |
Year | 2000 - 2008 |
I have developed this visualization using primarily dimple.js with some d3.js tweaks based on example visualizations from http://www.dimplejs.org, in particular the bubble chart and interactive legend examples.
I created a dimple story board which is animated by year from 2000 - 2008. The animation can be paused by selecting the year you want to pause on. You can restart animation by selecting the year a second time.
I also used an interactive legend by airline carrier for filtering the data points displayed, and I also created a 'chart' to handle filtering of the aggregated data by major city airports.
Note: Did not change the d3 and dimple js library references to us the internet links as they were not working properly and timing out
Please answer by creating an issue on my github repository following the example issue under my name:
https://github.com/cmiller112000/ud-datavis/issues/
https://review.udacity.com/#!/reviews/36166
Reviewer Comments Awesome Job!. Javascript is well implemented, good use of semicolons and indentation.
However there are some issues with the HTML and how javascript libraries are call. Following I review the different issues:
Javascript libraries import: Instead of working with d3 local files, you can simple call d3 library from their website (see line below). Please have a look at this link for further information.
https://www.dashingd3js.com/d3js-first-steps
DOCTYPE: this line must be included in order to allow browser to properly render the file, more info here
http://www.w3schools.com/tags/tag_doctype.asp
Encoding: for the browser to load the required chart set, you need to include the line below, see more info here:
http://www.w3schools.com/html/html_charset.asp
html content: it must be included in the body, please have a look at this link for a reference of a html template.
http://www.w3schools.com/html/html5_intro.asp
Once you edit your file, you can test your html using this powerful tool
Reviewer Comments This is a great visualization, you were able to include a lot of information and still make it look great. By selecting airlines, airports and years, viewers can really do a deep exploration of the data. Your d3 code is coded well, and you got some great feedback, well done!. When I look at the chart, I understand how average delay times behave along the week. But that's really an exploratory visualization rather than explanatory. What I can't tell from this plot is what drives the average delay times along the week. In your summary I can read: "From a day per week standpoint, all airlines seemed to have their lowest delay times on Tuesdays and Saturdays, and their highest delay times on Thursdays and Fridays. Which makes sense based on the typical busy travel days matching this finding.", so it seems delays are related with average flights per day. This is actually the key I miss in your visualization. By adding this piece of information your visualization now becomes explanatory, users can now understand why delay times behave in such way. STEPS TO PASS THIS SECTION: Incorporate the average flight number per day in your chart.
https://github.com/cmiller112000/ud-datavis/issues
Hi @cheryl_592988902, thanks for posting your latest version. I've taken a look and so I'll post a few thoughts on here to encourage more discussion! I hope you don't mind that I've not posted on GitHub.
Some things I like
Some things I like less
Some more ideas
Hi @cheryl_592988902, I agree with the comments from @Charlie. In addition
Nice chart!!
Like
Like Less
Great!!!
Hi @cheryl_592988902,
Nice chart, the design is very good as well as the different transitions
What I like :
What I like less :
Questions
What do you notice in the visualization?
A smooth transition between years
What questions do you have about the data?
How it was collected and how outliers are being handled (cancelled flights etc..)
What relationships do you notice?
Saturday being the best performer in terms of arriving on time Tuesday seems to go on and off across years
Southwest Airlines seems to have worked hard on their delays, going from a below average airline to a champion across the years
What do you think is the main takeaway from this visualization?
There's a lot of focus on Saturday instead of other weekdays for performance of airlines
Is there something you don’t understand in the graphic?
The negative values even though I understand that it is flights arriving before scheduled time but it's still odd
I hope this helps !
Kind Regards,
Yohann
regarding the disappearing data, fixed that and will be providing a new release later today or tomorrow.
What do you notice in the visualization?
What questions do you have about the data?
What relationships do you notice?
What do you think is the main take-away from this visualization?
Is there something you don’t understand in the graphic?
Any Additions Comments:
new version is hopefully much clearer (changed the bubble chart to a line chart with line markers), hopefully this will make the day to day relationships more clear. Regarding the yellow dots remaining in same spot, if you notice, the scale changes from year to year, that may be why it appears they are remaining the same. As for the purpose of the year over year, it makes it possible to see improvement and/or degradation in arrival times over time.
as for the yellow/orange color being too close, I changed the yellow to a light purple, so hopefully its easier to distinguish the different lines.
What do you notice in the visualization?
When filtered by airline, the airline dots (info) disappears from graph and when airline re-clicked it does not reappear. If all airlines are clicked the graph goes blank and stays blank, have to refresh screen to bring back data.
Sometimes after picking a year or airport the visualization starts scrolling again when should stay paused?
What questions do you have about the data?
What relationships do you notice?
What do you think is the main take-away from this visualization?
Is there something you don’t understand in the graphic?
Any Additions Comments:
Thanks Alan, good feedback! I have a new version I will be uploading later today or tomorrow (waiting for feedback from class peers). This new version fixes the disappearing data issue, and makes the airport filter clearer and easier to read. I also changed the bubble graph to a line graph with line markers so that the day to day differences are more obvious. I liked the bouncing balls, but it didn't make that relationship very clear.
I haven't figured out how to keep the animation paused when filtering by carrier or airport yet, still working to figure out how to do that.
What do you notice in the visualization?
What questions do you have about the data?
What relationships do you notice?
What do you think is the main take-away from this visualization?
Is there something you don’t understand in the graphic?
Any Additions Comments:
Thanks Joey! good feedback.
I have a new release coming later today or tomorrow that fixes the airport selector and a few other issues (like disappearing data). Hopefully that will make it clearer. While I liked the look of the bubble chart, I changed it to a line chart with line markers. It makes the day to day relationships much clearer.
Re: "How to keep raw data integrity in check - i.e.: one time meaning a plane pulls away from the gate, or actually takes off?"
I'm not clear on what you are asking? The raw data I based this on had multiple 'delay' timings and some (but limited) cause indicators. However, the data set itself, even filtering down to just these carriers and airports was till almost 2GB, and would never load in the browser using the tools I've been given. So I decided to just concentrate on the average arrival delay, thinking from a consumer standpoint, that is what most people would care about. I definitely see where the airlines or regulation industry would care much more on drilling down on specific causes. Is that what you were referring to?
Yes, that is what I was referring to, and it was more industry related, but given each airline had their own criteria for the definition of on time... Well, what can you do to control that?
I am really impressed with how you tamed THAT MUCH data in one file. Very nice!