query method with as = "data.frame" converts to datetimes then back to character

amoeba commented 4 years ago

@jeanetteclark noticed while doing some querying that query seems to cast Solr datetime fields to POSIXct but then re-cast to character when we specify as = "data.frame". This actually works mostly fine, despite introducing the need to re-parse those fields in a locale-dependent way.

But some objects have time portions of T00:00:00.000Z and you end up with strings like 2016-11-10 and 2020-06-03 16:00:00 UTC which makes downstream processing in R tricker for users.

See:

> class(
  query(cn, 
        list(q=paste0('id:"arctic-data.9686.1"'), fl = "id,dateUploaded"), 
        raw = TRUE)[[1]]$dateUploaded)
[1] "POSIXct" "POSIXt"

raw = TRUE maintains the type.

and

> class(
  query(cn, 
        list(q=paste0('id:"arctic-data.9686.1"'), fl = "id,dateUploaded"), 
        as = "data.frame")[1,"dateUploaded"])
[1] "character"

as = "data.frame" converts the POSIXcts to characters.

I think the culprit code is:

https://github.com/DataONEorg/rdataone/blob/master/R/D1Node.R#L863

I think we should:

Definitely do the cast to POSIXct as that's more useful to users
Not convert all fields to character if that's what's truly going on
Probably choose another way to concatenate multi-valued fields. We use ` right now which makes theorigin` field, for example, hard to work with. e.g., you get cell values like "Bryce Mecum Jeanette Clark" and you have to figure out how to parse that. I think list columns might be good to use here:
```
> x <- data.frame(id = 1,origin=I(list(c("Bryce Mecum", "Jeanette Clark"))))
> x$origin
[[1]]
[1] "Bryce Mecum"    "Jeanette Clark"
```

What do you think @gothub? Others?

gothub commented 4 years ago

@amoeba - yes, that looks like the correct approach.

gothub commented 4 years ago

@amoeba @jeanetteclark fyi - I'm fairly close to having a new rdataone release, but will have to wait until the dev scheduling meeting later today before I can say when the release can be sent to CRAN. I'll update this issue after that meeting

amoeba commented 4 years ago

Sounds good. Do you wanna tackle this?

gothub commented 4 years ago

Sure - you already figured out the problem and the fix, so I'll apply that and include it in the next release.

amoeba commented 4 years ago

Excellent, thanks.

jeanetteclark commented 4 years ago

Thanks both of you! Very excited about the list columns, in particular

gothub commented 4 years ago

An interim fix was checked into 673c5d652dbca659b3938069047a48be6012650f. This version properly retains the R datatype for data.frame cells that only has a single value. For cells with multiple values, the values are returned as a concatenated list separated by commas.

The final fix will populate data.frame cells with multiple values as a list of the proper R type, as originally requested in this ticket. This will take a bit more refactoring.

gothub commented 4 years ago

As requested by @jeanetteclark, the delimiter used for muti-value cells in the data.frame created by `query(..., as="data.frame") has been updated to "|". Again, this is just a temporary workaround and will be replaced by list-columns, when I get that working.

gothub commented 3 years ago

Fixed in commit 4323543224a8999e78aa9cc4f7cfa4c243fe4032

This fix modifies format of the data.frame that is created with the query result, such that multi-valued fields from the Solr result are inserted into a single data.frame element as a list, as @amoeba suggested. Please have a look at the code and if you see a different way to achieve this result, please pass it along. Any other R object/structure would cause the data.frame constructor to spread the values vertically (i.e. vectors are columns...).

https://github.com/DataONEorg/rdataone/blob/develop/R/D1Node.R#L864

amoeba commented 3 years ago

That looks like how I've done it in the past. Thanks for making the change, this should be really helpful.

DataONEorg / rdataone

query method with as = "data.frame" converts to datetimes then back to character #250