geysertimes / geysertimes-r-package

R package for accessing and analyzing the GeyserTimes database
Other
2 stars 4 forks source link

Clean Up duration Variable #6

Open spkaluzny opened 5 years ago

spkaluzny commented 5 years ago

I think we should provide a clean version of the duration column in the main eruption data set. I think that users will want duration to be some form of a numeric column for analysis. It is currently a character column with widely varying formats e.g. 1.5m, d=05m00s, 10m 24s, 6m30s, 1h18m00s and ~32m.

A user on the subreddit r/rstats recently specifically asked how to clean up the duration variable after downloading the Old_Faithful_eruptions.tsv.gz data from geysertimes.org. See: https://old.reddit.com/r/rstats/comments/bolrbx/how_to_clean_duration_column_from_geysertimesorg/

To consider:

taltstidl commented 5 years ago

I wasn't aware that people were using our archives that much, so that's good news for me :tada: Durations have historically been entered in plain text format, which is why they are greatly inconsistent in many ways. However, we have taken it upon ourselves to develop a parser that transforms the textual representation into a numerical value (in seconds). These are already widely used throughout the site (called standardized duration) and it's only a matter of adding these to the archives. I'll do my best to add these as soon as possible but cannot make any promises.

taltstidl commented 5 years ago

@spkaluzny I've added three new columns to the archive files (duration_seconds, duration_resolution and duration_modifier) that expose our parsed durations. These changes are undergoing an internal review now, but in the meantime you can download a sample here. If you have any feedback on the formatting or if there's anything else that might ease the usage of these new columns within this R package, please let us know. At this point in time we can still easily make changes since it's not live yet.

spkaluzny commented 5 years ago

@TR4Android I've updated my local version of the package to work with these 3 new duration related fields. All three new columns are read as character. The duration_resolution and duration_modifier columns are expected to be character but I think that users will expect duration_seconds to be numeric. It gets read as character because a missing (blank) value of duration gets encoded as NULL in duration_seconds. The missing (blank) value for duration gets read as NA by R and I think the duration_seconds value for a blank duration should be missing (blank) so it also gets read as NA by R. For duration_resolution and duration_modifier I think your NULL should also be missing (blank) so that the values are read by R as NA. The character string "NULL" has no special meaning in R. I could convert the NULLs to NA in the package data reading code but I think non-R users of the data would prefer the consistent use of missing (blank) instead of a mix of missing (blank) and NULL values for these columns.

I suspect that duration_seconds would be one of the most used variables in any analysis of the Geyser Times data. Just wondering if we want it to be such a long name. If starting from scratch, I would consider calling duration_seconds duration and the current duration something like duration_original (but the data has already been out for some time).

taltstidl commented 5 years ago

@spkaluzny Thanks for your feedback! I agree with your detailed assessment of the NULL situation and have changed all fields appropriately. Any NULL values should now be blanks for the three duration-related columns. Again, you can download a sample file here.

As to the naming: I fully agree with your sentiments, in fact I've mostly stuck to duration_text for the raw entered text and duration for the numeric seconds within GeyserTimes code. However, changing the name of the public-facing archive files would cause existing users to suddenly read the numeric value instead of the expected text. This potential confusion is why I'm a bit reluctant, but any input is greatly appreciated. I know it's not the column name you'd be looking for and it is a bit long 😕

taltstidl commented 4 years ago

@spkaluzny Do you currently make use of the three duration-related fields, and if so, do you currently expect these to return (blank) for missing values? If so, I'd merge the internal change so it's available for the live backups.

Also, a short update on the naming: Since we're going to be restructuring our database as part of our next release we're now open to any naming changes. More specifically, we will likely rename duration to duration_text and duration_seconds to duration, among other changes. Obviously, this will require updates to this package, so we'll make sure to give everyone advance notice here so we can transition in a coordinated manner.