eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
116 stars 13 forks source link

Merge deserialize branch #21

Closed eddelbuettel closed 4 years ago

eddelbuettel commented 4 years ago

Assuming there is nothing (major) left to do, shall we merge this?

It's an almost clean commit line apart from the one interim merge and then continued commits on the branch (you can sort-of see that in the middle )

image

Anyway, we likely have bigger fish to fry (and I could insist squash-merging/rebasing but I think I won't). But it might be a good time to carry this over to the main branch...

Thoughts, gentlemen?

knapply commented 4 years ago

@eddelbuettel I don't have an issue with however you want the commits to happen. I'm trying to minimize any hand-holding, but your continued patience is appreciated.

Short answer to your question: let's merge!

From where I stand, coming to a consensus on the user-facing R API should be the next priority.

eddelbuettel commented 4 years ago

Agreed! Thiis is a really nice (big) step forward from "oh look Dirk connected to simdjson, but we can't actually do anything with it" :)

We'll figure the rest out as we go along. By data.frame column do you mean list columns as in data.table and tibble (and base R, but without a pretty printer there) ?

knapply commented 4 years ago

For what it's worth, anyone in R land questioning your impact on the ecosystem hasn't actually looked into it ;)

Regarding data frames, I think you're more getting at something I've been playing with mentally, but haven't been able to fully articulate. I'll eventually have some stuff to show rather than try and tell.

What I'm getting at above is how columns are currently diagnosed (name, position, simdjson type, ultimate R type):

https://github.com/eddelbuettel/rcppsimdjson/blob/1c7ef149e3101e25ca79b365219d039f4f0ac215/inst/include/RcppSimdJson/deserialize/dataframe.hpp#L11-L19

The initial improvement is to use std::unordered_map instead of std::map, especially since Column already tracks the indices separately anyways 🤦‍♂️. I think this is the result of using std::map in the early prototype, then only being reminded that the data frame columns should follow the order in which object keys are encountered after running round-trip tests on all the data frames in {datasets}.

But.... there's another issue I only realized over the weekend: the JSON spec doesn't actually require unique object keys.

"The names within an object SHOULD be unique."

I had wrongly 🤦‍♂️🤦‍♂️ assumed that duplicate keys are techinically illegal, so we'd end up handling that after we feel good about valid JSON (if simdjson even supported it), but it is legal so simdjson does.

In my experience 1) duplicate keys are not common, 2) have always been a confirmable red flag that something else is wrong with the data, and 3) are not likely part of an object inside an array of only objects (meaning they're not data-frame-able and would never pass through this code anyway). I feel all of those go for R data frames will duplicate column names as well. But that's me, and unfortunately that's all totally anecdotal.

So a few questions need to be answered, which will require some experiments and checking if there's an informal standard among current R packages (CC @dcooley):

  1. How do edge cases get handled?
    • If one "row" has 2 of the same key, I guess the resulting data frame should have 2 identically named columns.
      • What happens when the next "row" only has one of that key?
        • Does it just go in the first of those identically named columns?
        • If so, why?
  2. How do you track maybe-duplicate keys (that may have completely different types) while also keeping track of the order in which those keys are found without unreasonable impacts on yuge file performance?
    • After doing more research (this is by far the most time I've spent w/ C++, so the learning curve is still near-vertical), I'm suspecting that the ideal solution will be rather custom.

All of that said, I don't think supporting duplicate column names should be prioritized nor should they delay this, but it should be addressed eventually.

Funny enough, this taught me that seemingly every Python JSON module is wrong if it uses a dictionary. As far as I know, that's all of them, including the standard library's.

>>> import json
>>> json.loads('{"a":1,"a":2}')
{'a': 2}
knapply commented 4 years ago

Since it turned out the duplicate key behavior actually follows the norm of other packages, I swapped out the std::maps for std::unordered_map (and removed a vestigial variable in the same function), so everything I said above is no longer relevant.

We're still waiting for a thumbs up from Dave, but if another commit can be stomached the final touch is available. Otherwise, it'll be on standby: https://github.com/knapply/rcppsimdjson/tree/feature/deserialize

dcooley commented 4 years ago

thumbs up from me!

eddelbuettel commented 4 years ago

Ok, will merge. Any objections to merging as a squash-and-merge because 33 is a little on the large side?

dcooley commented 4 years ago

as you see fit.