Closed eddelbuettel closed 4 years ago
We may have to condition the #pragma away. Building at the (official) win-builder.r-project.org I see a few
../inst/include/RcppSimdJson/deserialize/../../simdjson.h:115:48: note: #pragma message: The simdjson library is designed for 64-bit processors and it seems that you are not compiling for a known 64-bit platform. All fast kernels will be disabled and performance may be poor. Please use a 64-bit target such as x64 or 64-bit ARM.
use a 64-bit target such as x64 or 64-bit ARM.")
and while I wholeheartedly agree the CRAN Repo Policy may have something against it. Not sure, maybe it can stay.
We may have to condition
Is that doable from our side?
Everything is always just an #ifdef
away...
you have anything 'pending' and possibly working by end of week
I need to sort out the user-facing functions. None of the additions are accessible without RcppSimdJson:::
.
Here's the prototype that I had put together https://github.com/eddelbuettel/rcppsimdjson/issues/10#issuecomment-647046190
fparse()
was just a place holder, but I'm now leaning toward that.
@dcooley, I was hoping you'd chime in. The original discussion was about from_json()
, but I assumed you were referring to the functionality rather than the function's name. Did I understand that correctly?
Two other questions:
Do we want separate functions at the user-facing level for JSON that's inside an R character()
versus JSON files (including files sitting at a URL)?
fparse()
(or whatever we decide to call it) parses character()
s.
simdjson::dom::parser::parse()
fparse_many()
would be the multi-JSON (ndjson, jsonl, etc.) equivalent that we'll tackle later
simdjson::dom::parser::parse_many()
.deserialize_json()
under the hoodfload()
(or whatever we decide to call it) loads and parses files (including from URLs)
simdjson::dom::parser::load()
fload_many()
would be the multi-JSON equivalent that we tackle latersimdjson::dom::parser::load_many()
.load_json()
under the hoodI'm leaning toward separation as I'd like both functions to be "vectorized", and it's just simpler for the input to be either all files or all strings containing JSON (both to robustly code and for users to understand).
Can I drop GZstream directly into the package to handle .gz files?
I originally saw this in {ndjson}
and have since used it everywhere that I might touch compressed files. It prevents the need to inflate gz files in a separate step by essentially swapping out std::ifstream
for igzstream
.
It makes handling .gz files both easier and faster:
simdjson::parser::load()
will take care of us for non-compressed files as is.
Regarding separate fparse
and fload
: I like that plan.
What does GZstream
add that plain gzopen()
(from zlib.h
) not already offers? Do you need to seek backwards and forwards? I already gzopen()
back in the 1990s to just open gzip
-compressed files as if they were normal files. Still works the same way...
Oh I see. Stream use. Hm. We could include it. I tend to prefer to keep dependencies minimal. Can what you do be done with a FILE*
?
Also, it's tiny. We could arguably just take it and create RcppGzStream out it making it header only in the process. But it doesn't even seek
. Maybe gzopen()
?
What does GZstream add that plain gzopen() (from zlib.h) not already offers?
It's easy? Frankly, it's because I know how to use it and I know it works without much fuss... but that may just be because I don't know any better.
The main reason I've found it nice is that it plays along with std::getline()
.
This is a bit contrived, but it reflects something I deal with all the time... (the package isn't important, it's just that I already had stuff crammed in this folder as a playground)
// [[Rcpp::plugins(cpp17)]]
// [[Rcpp::depends(tweetio)]]
#include <Rcpp.h>
#include <tweetio.hpp>
#include <tweetio/third-party/simdjson/simdjson.cpp>
#include <tweetio/third-party/gzstream/gzstream.cpp>
template <typename stream_T> // just pick the stream
auto keep_valid_json_template(const std::string_view file_path) -> std::vector<std::string> {
stream_T in_file;
in_file.open(std::string(file_path).c_str()); // only difference? gzstream wants `const char *`
auto out = std::vector<std::string>();
simdjson::dom::parser parser;
std::string line;
while (std::getline(in_file, line)) {
if (auto [_, error] = parser.parse(std::string_view(line)); !error) {
out.emplace_back(std::move(line));
}
}
in_file.close();
return out;
}
auto is_gz(const std::string& file_path) -> bool {
return std::filesystem::path(file_path).extension() == ".gz";
}
// [[Rcpp::export(.keep_valid_json)]]
SEXP keep_valid_json(const std::string& file_path) {
const auto out = is_gz(file_path) ? keep_valid_json_template<igzstream>(file_path) // easy abstraction over both gz/non-gz files
: keep_valid_json_template<std::ifstream>(file_path);
return Rcpp::CharacterVector(std::begin(out), std::end(out));
}
bigish_jsonl <- Sys.glob("~/ufc-tweet-stream.json")
bigish_jsonl.gz <- Sys.glob("~/ufc-tweet-stream.json.gz")
scales::number_bytes(file.size(bigish_jsonl))
#> [1] "1 GiB"
scales::number_bytes(file.size(bigish_jsonl.gz))
#> [1] "122 MiB"
res1 <- .keep_valid_json(bigish_jsonl)
res2 <- .keep_valid_json(bigish_jsonl.gz)
identical(res1, res2)
#> [1] TRUE
scales::comma(length(res1))
#> [1] "140,946"
pryr::object_size(res1)
#> 901 MB
everything_kept <- readLines(bigish_jsonl, warn = FALSE)
pryr::object_size(everything_kept)
#> 1.08 GB
scales::comma(length(everything_kept))
#> [1] "559,067"
The point is that it's just easy to handle regular and gzipped files with minimal code duplication.
I'm thinking specifically about coping with .ndjson/.jsonl files that contain junk, which happens all the time with truly large files. simdjson::dom::parser::load_many()
will (rightfully) throw a fit.
By using it now for "vanilla" JSON, we're already set up to use it for NDJSON/JSONL/etc.
Buuuut..... maybe there's an easy way to do this with zlib.h
? I have no idea.
I'm game to set up an RcppGzStream
package, but it's only two files so dropping it in here directly is simple (and yes, no seek
).
It is easy. See eg here for how I handled it in RcppCNPy adding gz support to what was already there -- a little easier as I can pass the file pointer around but I'll try to cook up an example for you. Because as your approach is line-oriented, we should be able to just do it with the C library semantics.
This is embarassingly basic but worked. Took github_events.json
, copied it to /tmp
and called gzip
on it. The following file, to Rcpp::sourceCpp("demo.cpp")
-ed, then slurps it and shows it line by line. No compiler or linker switches needed as we get zlib from R. Needs the header to compile though, package zlib1g-dev
on Ubuntu. You likely have it already.
One shortcoming is the stack variable, and the fact that it would croak on longer lines.
#include <Rcpp.h>
#include <zlib.h>
// [[Rcpp::export]]
void simpleDemo(std::string & fname) {
gzFile fp = gzopen(fname.c_str(), "rb");
if (!fp) {
Rcpp::stop("npy_gzload: Error! Unable to open file %s!\n",fname.c_str());
}
char line[1024];
while (gzgets(fp, line, sizeof(line)) != NULL ) { // same as fgets()
std::string s(line);
Rcpp::Rcout << "Seeing: " << s;
}
gzclose(fp);
}
/*** R
simpleDemo("/tmp/github_events.json.gz") # smaller example file, gzip'ed
*/
@knapply
but I assumed you were referring to the functionality rather than the function's name. Did I understand that correctly?
correct.
@dcooley if you have anything 'pending' and possibly working by end of week at it here otherwise I just progress towards an upload.
Nothing more from me.
Great. End of week looks pretty viable then.
I think we're ready. I put up this ad-hoc script and tossed'em all at RHub:
#!/usr/bin/env Rscript
library(rhub)
selected <- c("debian-clang-devel",
"debian-gcc-devel",
"fedora-clang-devel",
"fedora-gcc-devel",
#"ubuntu-gcc-devel", ## uses 16.04 and compiler groans as not C++17 ready
"macos-highsierra-release",
"macos-highsierra-release-cran",
"windows-x86_64-devel",
"windows-x86_64-release")
check(".", platform = selected, email = getOption("email", "edd@debian.org"))
And apart from the one now-commented out they all passed with flying colours. Time to ship...
Awesome.
Do you have the error logs that ubuntu-gcc-devel/Ubuntu 16.04 spat out? It looks like gcc-5 is what comes standard, and simdjson itself needs gcc>=7, so I'm curious if it's wholly out of our hands.
Sure as I get the emails back. For as long as the time window holds:
RcppSimdJson 0.1.0: PREPERROR
Build ID: RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125 Platform: Ubuntu Linux 16.04 LTS, R-devel, GCC Submitted: 3 minutes 47.6 seconds ago Build time: 3 minutes 44.7 seconds
See the full build log: HTML: https://builder.r-hub.io/status/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125 Text: https://builder.r-hub.io/status/original/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125 Artifacts: https://artifacts.r-hub.io/RcppSimdJson_0.1.0.tar.gz-43f9a5c528574b2a975b1c09a02de125
Have questions, suggestions or want to report a bug? Please file an issue ticket at GitHub at https://github.com/r-hub/rhub/issues
Thank You for using the R-hub builder.
Also, hold back your awesome til you see the timings @knapply has. Now those are awesome :sunglasses:
But I agree with you on the g++
issue for Ubuntu 16.04. That one is probably on me with a task for tweaking condigure for not it letting it lie to me about C++17 when it really truly genuinely seems to fail at it.
Regarding C++11, I hope it's not simdjson itself that has driven you to C++17 :) If you're using it internally a lot, that's cool.
In 0.4 we reworked simdjson's error code API so C++11 feels at least as good as the C++17 structured bindings approach. I actually like it better.
The C++17 error code API exposes some limitations of structured bindings: you have declare 2 new variables. Also, the type has to go on the right (get
// Error API with C++11 get()
double x, y;
auto error = doc["x"].get(x);
if (error) { cerr << error << endl; exit(1); }
error = doc["y"].get(y);
if (error) { cerr << error << endl; exit(1); }
// This is my preferred syntax for C++11, though it's not everyone's cup of tea:
simdjson::error_code error;
double x, y;
if ((error = doc["x"].get(x))) { cerr << error << endl; exit(1); }
if ((error = doc["y"].get(y);) { cerr << error << endl; exit(1); }
// Error API with C++17 structured bindings
auto [x, error] = doc["x"].get<double>();
if (error) { cerr << error << endl; exit(1); }
auto [y, error2] = doc["y"].get<double>();
if (error2) { cerr << error2 << endl; exit(1); }
// Obviously, it looks best with exceptions, which work everywhere :)
double x = doc["x"];
double y = doc["y"];
The only place that I feel still works better in C++17 is iterating over an object, but honestly that doesn't look all that bad in C++11 either:
// Object iteration in C++11
for (auto field : obj) {
cout << field.key << ": " << field.value << endl;
}
// Object iteration with C++17 structured bindings
for (auto [key, value] : obj) {
cout << key << ": " << value << endl;
}
I think the object iteration is such a niche thing that it's worth using the C++11 syntax to buy compatibility (especially since it's still pretty good).
Food for thought. RcppSimdJson started out with C++17 only because ... well you guys did use it. Now that the reach is so my wider I might do different things.
Turns out there aren't many auto [x, error]'s. Just for a flavor, this code would look like:
for (auto&& object : array) {
simdjson::dom::element element;
if (object[key].get(element) == simdjson::SUCCESS) {
out[i_row] = get_scalar<scalar_T, R_Type, NO_NULLS>(element);
}
i_row++;
}
For that last one, actually, if you make get_scalar() capable of taking a simdjson_result
for (auto&& object : array) {
out[i_row] = get_scalar<scalar_T, R_Type, NO_NULLS>(object[key]);
i_row++;
}
And this code, with an object iteration, can be directly translated thus:
simdjson::dom::object object;
if (element.get(object) == simdjson::SUCCESS) {
for (auto&& field : object) {
if (cols.schema.find(field.key) == std::end(cols.schema)) {
cols.schema[field.key] = Column<type_policy>{col_index++, Type_Doctor<type_policy>()};
}
cols.schema[key].schema.add_element(value);
}
} else {
return std::nullopt;
}
}
(at_key will work too, I just tend to use [] for objects :)
The C++17 pieces are less about simdjson and more about dealing with the necessary type dynamism.
It currently leans on if constexpr
quite a bit to a) limit what seemed like gratuitous code duplication and b) maximize the extreme flexibility it currently offers without compromising any performance. But, that doesn't mean it can't be done.
I haven't measured it formally, but I wouldn't be surprised if simdjson
's parsing takes < 10% of the total time. The bottleneck is cramming the results into a limited menu of R objects.
Agreed. And not a worry for today --- today is 0.1.0 day to get it in front of hungry windows users!
Anything else to take care of? Otherwise I'll wrap up and ship.
Good on my end!
For what it's worth, RcppSimdJson as is (using C++17) builds and passes on GCC 7+ and Clang 6+, which are what simdjson itself supports (at least last I looked). I think that means we hit every compiler that both R and simdjson can currently use (no idea about Intel and I'm guessing there isn't some secret effort to get R to play along with Visual Studio).
But now I am curious what it would take... 🤔
On CRAN now: https://cran.r-project.org/package=RcppSimdJson
Will take a day or two or three for Windows binaries to be built.
It doesn't have all the bells and whistles of the other JSON packages, but it's definitely a beast.
install.packages("RcppSimdJson") # 🔥🚀
file <- tempfile(fileext = ".json")
download.file("https://github.com/zemirco/sf-city-lots-json/raw/master/citylots.json", destfile = file)
string <- readChar(file, nchars = file.size(file))
bench::mark(
RcppSimdJson::fparse(string),
jsonify::from_json(string),
jsonlite::fromJSON(string)
,
check = FALSE,
filter_gc = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 RcppSimdJson::fparse(string) 868.87ms 868.87ms 1.15 58.6MB 0
#> 2 jsonify::from_json(string) 7.66s 7.66s 0.131 104.5MB 0.653
#> 3 jsonlite::fromJSON(string) 36.15s 36.15s 0.0277 369MB 1.36
bench::mark(
RcppSimdJson::fload(file),
jsonify::from_json(file),
jsonlite::fromJSON(file)
,
check = FALSE,
filter_gc = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 RcppSimdJson::fload(file) 816.47ms 816.47ms 1.22 57.7MB 0
#> 2 jsonify::from_json(file) 7.16s 7.16s 0.140 104.5MB 0.838
#> 3 jsonlite::fromJSON(file) 38.48s 38.48s 0.0260 550.1MB 1.53
Nice work @knapply !
Many many thanks to @knapply @lemire @jkeiser and whoever else exercised magic to make this work on 32 bit Windows esp in the flavor required by CRAN, notable gcc via mingw and shipped via Rtools.
This sets the stage for a 0.1.0 release. Given that CRAN prefers a weekly cadence I will aim for an upload by the end of next week (i.e. around Jul 3 or 4) which gives us a few days to smooth and polish other aspects. @knapply @dcooley if you have anything 'pending' and possibly working by end of week at it here otherwise I just progress towards an upload.
Very cool to be able to hand the magic of simdjson to a million or two R Windows users. Thanks again for making it happen!