Closed dklinges9 closed 5 years ago
By the way, I flagged this with best practices
, which we can use as "issues" that aren't really issues but more should be left open. Periodically, we'll fold these into the contributing guidelines and then close.
Added this to a contributing.md that I just initialized.
We do a lot of importing of datasets, primarily excel files and .csv. For excel files, I know that I am careful to see how column types are parsed and if data is lost. But for .csv files, especially ones that we already curated...I admit that I am not. This can sometimes be a serious issue if the first 1,000 values of a column are
NA
and parsed as logical...this may cause all of the values of the column to be turned toNA
, which may go unnoticed later on. I want to improve my practices as well as all of ours. Four steps that I think we should be taking:we should parse individual columns more carefully. Each read_csv() command should very likely be accompanied by a few col_numeric() commands on cols that are parsed otherwise.
Each read_csv() command should be followed by a stop_for_problems() command. This will stop the script from running if there are problems, and rather than letting those warnings slide on by, we should start taking them more seriously.
We should change default parsers if needed; typically, changing to number.
The three of these practices together could look like this:
The issue with reading in this dataset using defaults was that some of the columns (e.g. total_pb210_activity) were parsed as logical because of many
NA
s, which dropped all of the actual values. So, changing the default to number, and then specifying what columns should not be numeric, fixes this.Hadley's Data Import chapter in R for Data Science has some helpful content that helped inform this. Let's keep these practices in mind when hooking new data, but also just as importantly for formatting and joining already-curated data.