Closed amoeba closed 3 years ago
@amoeba - yes, that looks like the correct approach.
@amoeba @jeanetteclark fyi - I'm fairly close to having a new rdataone release, but will have to wait until the dev scheduling meeting later today before I can say when the release can be sent to CRAN. I'll update this issue after that meeting
Sounds good. Do you wanna tackle this?
Sure - you already figured out the problem and the fix, so I'll apply that and include it in the next release.
Excellent, thanks.
Thanks both of you! Very excited about the list columns, in particular
An interim fix was checked into 673c5d652dbca659b3938069047a48be6012650f. This version properly retains the R datatype for data.frame cells that only has a single value. For cells with multiple values, the values are returned as a concatenated list separated by commas.
The final fix will populate data.frame cells with multiple values as a list of the proper R type, as originally requested in this ticket. This will take a bit more refactoring.
As requested by @jeanetteclark, the delimiter used for muti-value cells in the data.frame created by `query(..., as="data.frame") has been updated to "|". Again, this is just a temporary workaround and will be replaced by list-columns, when I get that working.
Fixed in commit 4323543224a8999e78aa9cc4f7cfa4c243fe4032
This fix modifies format of the data.frame that is created with the query result, such that multi-valued fields from the Solr result are inserted into a single data.frame element as a list, as @amoeba suggested. Please have a look at the code and if you see a different way to achieve this result, please pass it along. Any other R object/structure would cause the data.frame constructor to spread the values vertically (i.e. vectors are columns...).
https://github.com/DataONEorg/rdataone/blob/develop/R/D1Node.R#L864
That looks like how I've done it in the past. Thanks for making the change, this should be really helpful.
@jeanetteclark noticed while doing some querying that
query
seems to cast Solr datetime fields toPOSIXct
but then re-cast tocharacter
when we specifyas = "data.frame"
. This actually works mostly fine, despite introducing the need to re-parse those fields in a locale-dependent way.But some objects have time portions of
T00:00:00.000Z
and you end up with strings like2016-11-10
and2020-06-03 16:00:00 UTC
which makes downstream processing in R tricker for users.See:
raw = TRUE
maintains the type.and
as = "data.frame"
converts thePOSIXct
s tocharacter
s.I think the culprit code is:
https://github.com/DataONEorg/rdataone/blob/master/R/D1Node.R#L863
I think we should:
POSIXct
as that's more useful to userscharacter
if that's what's truly going onProbably choose another way to concatenate multi-valued fields. We use
` right now which makes the
origin` field, for example, hard to work with. e.g., you get cell values like "Bryce Mecum Jeanette Clark" and you have to figure out how to parse that. I think list columns might be good to use here:What do you think @gothub? Others?