IMCR-Hackathon / Hackathon-Central-2018

Command center for IMCR Hackathon participants to share ideas, coordinate teams, develop projects and access all logistics information
3 stars 0 forks source link

Code to report problem cells in Excel spreadsheets #6

Open vanderbi opened 6 years ago

vanderbi commented 6 years ago

FCE LTER scientists often concatenate multiple years of data into a single dataset to submit to me. Each year of data may have been managed in a slightly different way, e.g. using different date formats, different codes for missing data, different site codes, etc. It would be nice if there was a tool that could parse an Excel file and report which date formats are found, what the list of unique site names or species names is, what missing value codes are used (based on a known set of likely candidates), if there are empty cells with spaces in them, if there are some extraneous characters in the file not under column headers, and so on. It would also be nice to have code that would make educated guesses about the dataset and produce an EML attribute list as a starting point for documenting the table.

dpkode commented 6 years ago

This kind of problem rears its head in almost every data-ish field in which I've participated and working on approaches to address the specific problems like the ones you mention would be very practical. OpenRefine (http://openrefine.org/) is a non-scripting tool that could be helpful with some of these problems.