Messy data - Githubissues

ethanwhite commented 9 years ago

As mentioned in https://github.com/datacarpentry/sql-ecology/pull/18 and discussed elsewhere, we often have a need for data that is at least no perfectly tidy for the purposes of teaching what to do with messy data.

@skmorgane and I are happy to support the inclusion of messy versions of the data in the Portal Teaching Database for this purpose if that's desirable. We just need to decide on what exactly we want to include in this regard.

lwasser commented 9 years ago

i think this is a great idea. having a few issues in the data, creates "learning moments" where students can see how inconsistency can affect analysis.

https://github.com/datacarpentry/python-ecology

The python lessons currently have a few things "wrong" with the data. For one, the species table has species_id and a species column for full name. The observation table has species as the key -- this creates an interesting outcome when joining using pandas. I will know more about the R lessons after next week.

I suppose the alternative is to have students "mess up" the data prior to the lesson to see "what happens if". Does anyone else have any thoughts on this?

naupaka commented 9 years ago

I have a set of 'messed up' files from the WSU workshop that are specifically tailored to teaching OpenRefine. I could imagine that different versions would be useful in different contexts, but also see the value in having a canonical set of broken files.

ethanwhite commented 9 years ago

I could imagine that different versions would be useful in different contexts, but also see the value in having a canonical set of broken files.

I can see a small number of different messy files would potentially be useful. I think what we want to avoid is having every different instructor/lesson make a separate set of messy versions. Let's list out here the different things that would be useful in different contexts and see what we end up with and how we might usefully combine them into a small number of messy files.

tracykteal commented 9 years ago

I have a messy dataset that I've used several times now, and it seems to work well

http://datacarpentry.github.io/2015-05-29-great-plains/spreadsheet-ecology/survey_data_tabs.xls

and the associated lesson that I need to clean up and submit as a PR http://datacarpentry.github.io/2015-05-29-great-plains/spreadsheet-ecology/

tracykteal commented 9 years ago

Good point. I use the above in the spreadsheet lesson, but I think you want a different messy file for OpenRefine.

For spreadsheets you want to mess it up for people to be able to go through and identify common spreadsheet problems, and so you don't want 35,000 records. For OpenRefine you want to demonstrate those principles in clean up, and there a lot of records is an important component.

I do really like @naupaka idea of doing the clean up in OpenRefine and then going on to use those in SQL and/or R. For that we'd have to also provide the cleaned-up version, so people could use that in case they didn't have a chance to clean them all the way. Or we could have people use that one by default, so we ensure it was cleaned up properly, but we can refer back to 'remember how we cleaned it up so that sex only had M and F. We can see here in R, how those are the only two factors'.

However, everyone generating their own file for the next lesson could introduce errors in the next lesson.

naupaka commented 9 years ago

@tracykteal I used the same file as you did to teach cleaning in Excel; I think it worked really well. At WSU, we had people use their own cleaned files from OpenRefine for SQL, but we also had a cleaned version ready to go in case people were not able to get through all the cleaning steps when we went through them as a group (I'd say about 10% of students?).

lwasser commented 9 years ago

hi guys -- to followup on @ethanwhite s suggestion, here is the beginnings of a list of things we might want in our messy dataset. Then we can create / use one that has all of the messiness in it. :) i think it's best to ID the components we want to ensure are in the data rather than look for the best messy dataset if that makes any sense :) . there are great opportunities to learn from this dataset as things go awry in R given duplicate column names, incorrect subsets, etc.

Two csv's with a heading name that is the same but each column actually contains different data. eg species_id and species but both columns contain the species id.
Data with species names that are not spelled consistently in each row - thus you need to pattern match or clean w open refine.
column with numeric data that contains text that causes the column to import as a string rather than an num format.

I think there is more as well.. but these are things i can remember being useful learning moments off hand. if there is a better medium to add to the list, i'm happy to post there as well. Cheers leah

ethanwhite commented 9 years ago

Additional discussion of packaging data going on in https://github.com/datacarpentry/workshop-ecology/issues/1

ethanwhite commented 6 years ago

Messy data was added to the teaching database a while back.

datacarpentry / planning

Messy data #3