eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

R crashes when parsing a heavily nested json. #63

Closed GentleGhostCoder closed 3 years ago

GentleGhostCoder commented 3 years ago

Hi there, I cannot parse a certain Json, because R crash while parsing. The Json is a query-response from Prometheus and maps the CPU performance of some servers.

What I have tested is the parse function with various max-simplify-level parameters. The fminify and is_valid_json functions work fine, but don't help with the fparse problem either. If I reduce the result array, the parsing works again. When I use jsonlite to parse it works fine.

Unfortunately, I cannot attach a sample file as it is company data.

The prometheus request looks like this: https://:/api/v1/query?query=node_cpu_seconds_total%7B%7D"

The response content size ist: 50552672 bytes

The scheme looks like this: { "type": "object", "required": [], "properties": { "status": { "type": "array", "items": { "type": "string" } }, "data": { "type": "object", "required": [], "properties": { "resultType": { "type": "array", "items": { "type": "string" } }, "result": { "type": "array", ##### Up to 205808 objects "items": { "type": "object", "required": [], "properties": { "metric": { "type": "object", "required": [], "properties": { "Surname": { "type": "array", "items": { "type": "string" } }, "alias": { "type": "array", "items": { "type": "string" } }, "cluster": { "type": "array", "items": { "type": "string" } }, "cpu": { "type": "array", "items": { "type": "string" } }, "datacenter": { "type": "array", "items": { "type": "string" } }, "instance": { "type": "array", "items": { "type": "string" } }, "job": { "type": "array", "items": { "type": "string" } }, "Fashion": { "type": "array", "items": { "type": "string" } } } }, "value": { "type": "array", "items": { "type": "array", "items": { "type": "number" } } } } } } } } } }

lemire commented 3 years ago

The underlying library (simdjson) should not crash.

Could you provide some thoughts on how the rppsimdjson teammight be able to identify and fix the issue given the information you provided ?

eddelbuettel commented 3 years ago

It would be helpful to have a minimally reproducible example. R too should not crash, nor should our glue around simdjson introduce one.

GentleGhostCoder commented 3 years ago

Since I cannot provide a sample file, it is difficult for you to analyze the problem in more depth. I was rather hoping that you could tell me what else I could possibly test or how I could narrow down the problem.

What I can do is create a reproducible sample file that includes generated data for you.

lemire commented 3 years ago

@semmjon The fact that is_valid_json does not crash suggests that the parsing (in C++) works fine. Because rcppsimdjson checks validity by producing a full DOM, it knows that the document can be parsed and materialized as a tree. It actually does so, in full. This narrows down somewhat the problem.

GentleGhostCoder commented 3 years ago

Here is an example where it is already crashing. I could try to analyze more precisely at what array size it crashes.

metric <- '{
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"datacenteraggregation",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"iowait"
            },
            "value":[
               12345656.643,
               "12345656.643"
            ]
         }'

test <- paste0('{
   "status":"success",
   "data":{
      "resultType":"vector",
      "result":[
         {
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"someAlias",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"idle"
            },
            "value":[
               12345656.643,
               "12345656.643"
            ]
         },
         ',paste(lapply(1:200000,function(x) paste(metric)),collapse=","),'
      ]
   }
}')

test <- RcppSimdJson::fparse(test)
lemire commented 3 years ago

@semmjon Is the 200000 parameter minimal? That is, it only crashes when it is 200000, but does not when you have 100000. Is that it? And you confirm that is_valid_json on this input works, right?

GentleGhostCoder commented 3 years ago

I just tested it again and it started to crash at >65290. And yes is_valid_json works.

eddelbuettel commented 3 years ago

BTW you can use ` r (without the space) to open an R code segment and ` (ditto) to close it. I have my hands full right now but I take a look later. Also paging @knapply for good measure.

GentleGhostCoder commented 3 years ago

Hmm weird it just seems to crash on the Rstudio server. not on the local studio. Possibly a problem with the R version?

The Server has R version 3.6.3. Locally i have 4.0.3.

eddelbuettel commented 3 years ago

Is this on purpose:

"value":[
               12345656.643,
               "12345656.643"
            ]

?

GentleGhostCoder commented 3 years ago

Yes, I took it from the original data.

eddelbuettel commented 3 years ago

Does not reproduce:

edd@rob:~/git/rcppsimdjson(master)$ head issue63.R
metric <- '{
            "metric":{
               "__name__":"node_cpu_seconds_total",
               "alias":"datacenteraggregation",
               "cluster":"someCluster",
               "cpu":"0",
               "datacenter":"some-Datacenter",
               "instance":"instance.endpoint:1234",
               "job":"clusters",
               "mode":"iowait"
edd@rob:~/git/rcppsimdjson(master)$ 
edd@rob:~/git/rcppsimdjson(master)$ tail issue63.R 
               "12345656.643"
            ]
         },
         ',paste(lapply(1:200000,function(x) paste(metric)),collapse=","),'
      ]
   }
}')

test <- RcppSimdJson::fparse(test)
cat("Still here.\n")
edd@rob:~/git/rcppsimdjson(master)$ 
edd@rob:~/git/rcppsimdjson(master)$ Rscript issue63.R
Still here.
edd@rob:~/git/rcppsimdjson(master)$ 
knapply commented 3 years ago

"works on my machine"

Can you provide a sessionInfo()?

GentleGhostCoder commented 3 years ago

R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_3.6.3 htmltools_0.5.1 tools_3.6.3 yaml_2.2.1 rmarkdown_2.6 knitr_1.30 xfun_0.20 digest_0.6.27 packrat_0.5.0 rlang_0.4.10 evaluate_0.14 RcppSimdJson_0.1.3

eddelbuettel commented 3 years ago

Can you please do what I did and copy your code to a file, add a cat("Done\n") or alike at the end, and run it in a terminal via Rscript.

If that passes it means it's not us but some weird interaction or resource starvation happening with your RStudio session. We can try to narrow it down to a RStudio sessions without xfun, knitr, ... and all those other packages which RcppSimdJson does not need. There should not be an interaction here but one never knows...

In short, we need something reproducible. Which we currently do not have. (Though I appreciate your code snippet. It's a valid first step, but in this case one that allowed us to disprove the claim too.)

GentleGhostCoder commented 3 years ago

sgeist@rstudio:~$ head crash_rcppsimdjson.R

metric <- '{ "metric":{ "name":"node_cpu_seconds_total", "alias":"datacenteraggregation", "cluster":"someCluster", "cpu":"0", "datacenter":"some-Datacenter", "instance":"instance.endpoint:1234", "job":"clusters", sgeist@rstudio:~$ Rscript crash_rcppsimdjson.R still here.

Ok it works from console.

GentleGhostCoder commented 3 years ago

After I reinstalled RcppSimdJson the bug was solved :roll_eyes: Very mysterious and incomprehensible ... probably not a common case.

knapply commented 3 years ago

Was this the crash?

sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 18.04.4 LTS
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
# LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
# 
# locale:
#   [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
# [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.6.3 tools_3.6.3

RStudio.Version()[c("mode", "version")]
# $mode
# [1] "server"
# 
# $version
# [1] ‘1.2.5001’

image

GentleGhostCoder commented 3 years ago

Yes At this point, thank you for your help / support and keep it up (I think the package is great :thumbsup:)

eddelbuettel commented 3 years ago

Ok, no worries. I'll close this then -- feel free to reopen if it rears its head again.

lemire commented 3 years ago

Did you guys figured out what caused the crash?

eddelbuettel commented 3 years ago

No. And as I wrote recently on the rcpp-devel list a propos one of the micro-releases to the the github-hosted repo, there is some apparent instability in the toolchain. My reverse-dependency universe is now ~ 2200 packages. I worked on a branch refactoring some internals last year and stopped it when an initial run showed ~ 10% (give or take) breaking and I stopped it. The refactor is important though (an internal how-to-grow-large objects thing) and @enchufa2 recently picked up the branch and finished it. We once again had ~ 10% breakage ... but I had just released 1.0.6 and seen that a certain nexus of packages around rstan would fail tests "in an odd way" at run-time. (Failing to compile is more blunt and an abvious API change). So we did that and lo-and-behold the breakage went away. (There was still some related to one or two other CRAN packages.)

So all this sermon just to say that "something" appears to be binary-toolchain-fragile but I do not know what it is. Recompilation helps, and that was the change here too.

RcppSimdJson is a good testbed as simdjson is clean -- we don't schlepp any other depends in. So in short: not sure what it is, but rebuilding makes it go away.