DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

query method with as = "data.frame" converts to datetimes then back to character #250

Closed amoeba closed 3 years ago

amoeba commented 4 years ago

@jeanetteclark noticed while doing some querying that query seems to cast Solr datetime fields to POSIXct but then re-cast to character when we specify as = "data.frame". This actually works mostly fine, despite introducing the need to re-parse those fields in a locale-dependent way.

But some objects have time portions of T00:00:00.000Z and you end up with strings like 2016-11-10 and 2020-06-03 16:00:00 UTC which makes downstream processing in R tricker for users.

See:

> class(
  query(cn, 
        list(q=paste0('id:"arctic-data.9686.1"'), fl = "id,dateUploaded"), 
        raw = TRUE)[[1]]$dateUploaded)
[1] "POSIXct" "POSIXt" 

raw = TRUE maintains the type.

and

> class(
  query(cn, 
        list(q=paste0('id:"arctic-data.9686.1"'), fl = "id,dateUploaded"), 
        as = "data.frame")[1,"dateUploaded"])
[1] "character"

as = "data.frame" converts the POSIXcts to characters.

I think the culprit code is:

https://github.com/DataONEorg/rdataone/blob/master/R/D1Node.R#L863

I think we should:

What do you think @gothub? Others?

gothub commented 4 years ago

@amoeba - yes, that looks like the correct approach.

gothub commented 4 years ago

@amoeba @jeanetteclark fyi - I'm fairly close to having a new rdataone release, but will have to wait until the dev scheduling meeting later today before I can say when the release can be sent to CRAN. I'll update this issue after that meeting

amoeba commented 4 years ago

Sounds good. Do you wanna tackle this?

gothub commented 4 years ago

Sure - you already figured out the problem and the fix, so I'll apply that and include it in the next release.

amoeba commented 4 years ago

Excellent, thanks.

jeanetteclark commented 4 years ago

Thanks both of you! Very excited about the list columns, in particular

gothub commented 4 years ago

An interim fix was checked into 673c5d652dbca659b3938069047a48be6012650f. This version properly retains the R datatype for data.frame cells that only has a single value. For cells with multiple values, the values are returned as a concatenated list separated by commas.

The final fix will populate data.frame cells with multiple values as a list of the proper R type, as originally requested in this ticket. This will take a bit more refactoring.

gothub commented 4 years ago

As requested by @jeanetteclark, the delimiter used for muti-value cells in the data.frame created by `query(..., as="data.frame") has been updated to "|". Again, this is just a temporary workaround and will be replaced by list-columns, when I get that working.

gothub commented 3 years ago

Fixed in commit 4323543224a8999e78aa9cc4f7cfa4c243fe4032

This fix modifies format of the data.frame that is created with the query result, such that multi-valued fields from the Solr result are inserted into a single data.frame element as a list, as @amoeba suggested. Please have a look at the code and if you see a different way to achieve this result, please pass it along. Any other R object/structure would cause the data.frame constructor to spread the values vertically (i.e. vectors are columns...).

https://github.com/DataONEorg/rdataone/blob/develop/R/D1Node.R#L864

amoeba commented 3 years ago

That looks like how I've done it in the past. Thanks for making the change, this should be really helpful.