Open spkaluzny opened 5 years ago
I wasn't aware that people were using our archives that much, so that's good news for me :tada: Durations have historically been entered in plain text format, which is why they are greatly inconsistent in many ways. However, we have taken it upon ourselves to develop a parser that transforms the textual representation into a numerical value (in seconds). These are already widely used throughout the site (called standardized duration) and it's only a matter of adding these to the archives. I'll do my best to add these as soon as possible but cannot make any promises.
@spkaluzny I've added three new columns to the archive files (duration_seconds
, duration_resolution
and duration_modifier
) that expose our parsed durations. These changes are undergoing an internal review now, but in the meantime you can download a sample here. If you have any feedback on the formatting or if there's anything else that might ease the usage of these new columns within this R package, please let us know. At this point in time we can still easily make changes since it's not live yet.
@TR4Android I've updated my local version of the package to work with these 3 new duration
related fields. All three new columns are read as character. The duration_resolution
and duration_modifier
columns are expected to be character but I think that users will expect duration_seconds
to be numeric. It gets read as character because a missing (blank) value of duration
gets encoded as NULL
in duration_seconds
. The missing (blank) value for duration
gets read as NA
by R and I think the duration_seconds
value for a blank duration
should be missing (blank) so it also gets read as NA
by R. For duration_resolution
and duration_modifier
I think your NULL
should also be missing (blank) so that the values are read by R as NA
. The character string "NULL"
has no special meaning in R. I could convert the NULL
s to NA
in the package data reading code but I think non-R users of the data would prefer the consistent use of missing (blank) instead of a mix of missing (blank) and NULL
values for these columns.
I suspect that duration_seconds
would be one of the most used variables in any analysis of the Geyser Times data. Just wondering if we want it to be such a long name. If starting from scratch, I would consider calling duration_seconds
duration
and the current duration
something like duration_original
(but the data has already been out for some time).
@spkaluzny Thanks for your feedback! I agree with your detailed assessment of the NULL
situation and have changed all fields appropriately. Any NULL
values should now be blanks for the three duration-related columns. Again, you can download a sample file here.
As to the naming: I fully agree with your sentiments, in fact I've mostly stuck to duration_text
for the raw entered text and duration
for the numeric seconds within GeyserTimes code. However, changing the name of the public-facing archive files would cause existing users to suddenly read the numeric value instead of the expected text. This potential confusion is why I'm a bit reluctant, but any input is greatly appreciated. I know it's not the column name you'd be looking for and it is a bit long 😕
@spkaluzny Do you currently make use of the three duration-related fields, and if so, do you currently expect these to return (blank) for missing values? If so, I'd merge the internal change so it's available for the live backups.
Also, a short update on the naming: Since we're going to be restructuring our database as part of our next release we're now open to any naming changes. More specifically, we will likely rename duration
to duration_text
and duration_seconds
to duration
, among other changes. Obviously, this will require updates to this package, so we'll make sure to give everyone advance notice here so we can transition in a coordinated manner.
I think we should provide a clean version of the
duration
column in the main eruption data set. I think that users will wantduration
to be some form of a numeric column for analysis. It is currently a character column with widely varying formats e.g.1.5m
,d=05m00s
,10m 24s
,6m30s
,1h18m00s
and~32m
.A user on the subreddit
r/rstats
recently specifically asked how to clean up theduration
variable after downloading theOld_Faithful_eruptions.tsv.gz
data fromgeysertimes.org
. See: https://old.reddit.com/r/rstats/comments/bolrbx/how_to_clean_duration_column_from_geysertimesorg/To consider:
duration
column, perhaps calling itduration_orig
and add a new numericduration
column