IMCR-Hackathon / Hackathon-Central-2018

Command center for IMCR Hackathon participants to share ideas, coordinate teams, develop projects and access all logistics information
3 stars 0 forks source link

Ingest data w/multiple timestamp formats; detect; QA/QC; store as ISO #10

Open jamesrco opened 6 years ago

jamesrco commented 6 years ago

I often find I'm given multiple, timestamped datasets that belong to the same or complementary projects. Future users of these data (including me) often want to compare/evaluate these data side-by-side since they're related, but issues with the timestamps sometimes make this a huge pain. I end up spending a large amount of time manually reformatting timestamps or cajoling them into the "right" format. I've found a lot of the existing tools in R, Python, and MATLAB which purport to assist with these kinds of tasks to be lacking.

The data I receive often have one or more of the following features:

  1. Timestamps are in different formats completely
  2. Some times are in GMT; others are local to a particular time zone (and often, this time zone is not even explicitly specified)
  3. Missing timestamps, or timestamps simply missing one component of the timestamp
  4. Data messed up by Microsoft Excel in which points collected "on the hour" produce entries that omit minutes and seconds

I propose developing a set of tools in R and/or Python to assist in ingesting/formatting, and storing timestamps in a common format. Could make use of some existing functionality in existing R or Python packages; our product could even been a modification/contribution to the existing package code.

jpshanno commented 6 years ago

Some of the timestamp issues are taken care of lubridate in R. I think the function is parse_datetime(), it lets you specify multiple timestamp formats and set how many levels of detail can be truncated before throwing an error.

jamesrco commented 6 years ago

Thanks @jpshanno -- I have played around with lubridate quite a bit but will give it another look! I've found the primary issue is often that information such as the correct time zone/offset is easily lost when different packages or functions are used for interconversion of POSIXlt and POSIXct objects. Unfortunately, I've found the use of multiple functions/packages necessary since they don't all play nice with ISO 8601... something I've found surprising.

jpshanno commented 6 years ago

Yeah, multiple packages can definitely wreak some havoc, maybe we can find a solution this week.

If I remember right the functionality was a little buried, but I have some lubridate code that uses the multiple formats that I can share with you.