JuliaInterop / JuliaCall

Embed Julia in R
https://non-contradiction.github.io/JuliaCall/index.html
Other
267 stars 36 forks source link

Base.Meta.ParseError("extra token after end of expression") #161

Closed phillc73 closed 3 years ago

phillc73 commented 3 years ago

I was trying to benchmark some JuliaCall code against native R code. The JuliaCall standalone code works, but when trying to include this in a microbenchmark an error is returned.

This works:

library(JuliaCall)

julia <- julia_setup(JULIA_HOME = "/home/phillc/bin/julia-1.5.3/bin/")

julia_library("Query")
julia_library("DataFrames")

julia_assign("mtcars", mtcars)

julia_command("
x = @from i in mtcars begin
    @where i.disp > 200
    @select {i.mpg, i.cyl}
    @collect DataFrame
end
")

This fails:

library(microbenchmark)
library(dplyr)
library(data.table)

# Make a data.table
mtcars_data_table <- data.table(mtcars)

test_bench <- microbenchmark(times=500,
                            # Queryjl
                            Queryjl = {julia_command("
                                        x = @from i in mtcars begin
                                            @where i.disp > 200
                                            @select {i.mpg, i.cyl}
                                            @collect DataFrame
                                        end
                                        ")},
                             # data.table library
                             data_table = {mtcars_data_table[disp >= 200, c("mpg", "cyl"),]},
                             # dplyr library
                             dplyr = {mtcars %>%
                               dplyr::filter(disp >= 200) %>%
                               dplyr::select(mpg,cyl)}
                            )

Error is:

Base.Meta.ParseError("extra token after end of expression")
Stacktrace:
 [1] parse(::String; raise::Bool, depwarn::Bool) at ./meta.jl:220
 [2] parse at ./meta.jl:215 [inlined]
 [3] eval_string(::String) at /home/phillc/R/x86_64-pc-linux-gnu-library/4.0/JuliaCall/julia/setup.jl:203
 [4] docall(::Ptr{Nothing}) at /home/phillc/R/x86_64-pc-linux-gnu-library/4.0/JuliaCall/julia/setup.jl:176

R sessionInfo()

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8          LC_NUMERIC=C                  LC_TIME=en_GB.UTF-8           LC_COLLATE=en_GB.UTF-8        LC_MONETARY=en_GB.UTF-8       LC_MESSAGES=en_GB.UTF-8      
 [7] LC_PAPER=en_GB.UTF-8          LC_NAME=en_GB.UTF-8           LC_ADDRESS=en_GB.UTF-8        LC_TELEPHONE=en_GB.UTF-8      LC_MEASUREMENT=en_GB.UTF-8    LC_IDENTIFICATION=en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.7     dplyr_1.0.3           microbenchmark_1.4-7  JuliaCall_0.17.2.9000

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       assertthat_0.2.1 crayon_1.3.4     R6_2.5.0         DBI_1.1.1        lifecycle_0.2.0  magrittr_2.0.1   pillar_1.4.7     rlang_0.4.10     vctrs_0.3.6      generics_0.1.0  
[12] ellipsis_0.3.1   tools_4.0.4      glue_1.4.2       purrr_0.3.4      xfun_0.21        compiler_4.0.4   pkgconfig_2.0.3  tidyselect_1.1.0 knitr_1.31       tibble_3.0.5    
Non-Contradiction commented 3 years ago

Thank you very much for the feedback!

The error message is about parsing, so I suspect the issue is the formatting of the julia command. Actually, it works for me if we define the query in a separate string and then pass the string to the function like this:

query <- "
x = @from i in mtcars begin
    @where i.disp > 200
    @select {i.mpg, i.cyl}
    @collect DataFrame
end
"

library(microbenchmark)
library(dplyr)
library(data.table)

# Make a data.table
mtcars_data_table <- data.table(mtcars)

test_bench <- microbenchmark(times=500,
                             # Queryjl
                             Queryjl = {julia_command(query)},
                             # data.table library
                             data_table = {mtcars_data_table[disp >= 200, c("mpg", "cyl"),]},
                             # dplyr library
                             dplyr = {mtcars %>%
                                 dplyr::filter(disp >= 200) %>%
                                 dplyr::select(mpg,cyl)}
)

And I suggest this approach whenever we nest some multi-line julia command string in R code.

And I see that you are trying to benchmark the code. Actually, julia_command is quite inefficient and designed for interactive use. Whatever command it evaluate, it is like eval(parse(text = "....")) in R. Another thing is that the julia_command executes in the global scope, and this is the first thing that we need to pay attention in the julia performance tips: https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-tips

phillc73 commented 3 years ago

Thanks! I've managed to find a way to make it work.

This works:

query_a <- "
x = @from i in mtcars begin
    @where i.disp > 200
    @select {i.mpg, i.cyl}
    @collect DataFrame
end
"

julia_command(query_a)

However, this does not work:

query_b <- "
         x = @from i in mtcars begin
             @where i.disp > 200
             @select {i.mpg, i.cyl}
             @collect DataFrame
        end
        "

julia_command(query_b)

It seems like white space is somehow very important here when parsing strings. If one looks at the two query strings the difference is clear, but I'm still not really clear on why parsing one works, but not the other.

r$> query_a                                                                                                                                                                                            
[1] "\nx = @from i in mtcars begin\n    @where i.disp > 200\n    @select {i.mpg, i.cyl}\n    @collect DataFrame\nend\n"

r$> query_b                                                                                                                                                                                            
[1] "\n         x = @from i in mtcars begin\n             @where i.disp > 200\n             @select {i.mpg, i.cyl}\n             @collect DataFrame\n        end\n        "

If the problem is known, and specifically perhaps to do with the white space at the end of query_b, which is a guess due to the extra token at end of message error, perhaps this could be dealt with by JuliaCall when parsing strings to Julia?

Edit

The issue is the trailing white space.

This works:

library(stringi)

query <- stri_trim_both("
                        x = @from i in mtcars begin
                          @where i.disp > 200
                          @select {i.mpg, i.cyl}
                          @collect DataFrame
                       end
                        ")

julia_eval(query)

I have a way forward to not have to worry too much about this now. However, it still might be useful for others if JuliaCall could handle this string parsing issue.


On a related note, is there a faster way to do this than using julia_command? I tried julia_eval but results were pretty much identical.

Non-Contradiction commented 3 years ago

Thank you very much for exploring on this! I will trim the string before julia_command send the string for Julia to evaluate.

julia_eval is almost identical to julia_command. The only difference is whether the evaluation result is transferred from julia to R, and whether the julia display mechanism is invoked.

From the above link to the performance tips, you can see that the julia performance tip really emphasize on writing functions, which is also the case for JuliaCall.

  1. So I would first consider writing a Julia function like this:
    julia_command("
    function queryjl(mtcars)
    x = @from i in mtcars begin
        @where i.disp > 200
        @select {i.mpg, i.cyl}
        @collect DataFrame
    x
    end
    end")

    And when you evaluate the above, you should see queryjl (generic function with 1 method), which says the function is defined on the Julia side successfully. And the function is ready to use by things such as julia_command("queryjl(mtcars)"). However, this only deals with the second point I mentioned, that the code execution in the global environment is slow. julia_command still need to invoke eval and parse in Julia.

  2. So we could further consider using julia_call interface, which calls julia function directly instead of eval and parse some string command. In this case, you can just use julia_call("queryjl", mtcars). Note that in this way we do not need to julia_assign("mtcars", mtcars) first, because JuliaCall will look for the R object mtcars, and then try to convert it to some Julia object and then call the julia function.
  3. The R->julia conversion actually takes some time. Similar to mtcars_data_table <- data.table(mtcars), we could further consider using JuliaObject to do the R->julia conversion before hand:
    # Make a JuliaObject for JuliaCall to use in julia_call function 
    ## without type conversion over and over again
    mtcars_julia_object <- JuliaObject(mtcars)

    The JuliaObject do the R->Julia conversion and the result is a wrapper on the R side which points to the object on the julia side. And when JuliaCall see the JuliaObject, it knows that the conversion is already done and just grab the actual julia object pointed to. We can now use it like this: julia_call("queryjl", mtcars_julia_object).

In summary, we can benchmark the different methods roughly like this:

test_bench <- microbenchmark(times=500,
                             # Queryjl
                             Queryjlcommand = {julia_command("queryjl(mtcars);")},
                             Queryjlcall_withoutconversionbefore = {julia_call("queryjl", mtcars)},
                             Queryjlcall = {julia_call("queryjl", mtcars_julia_object)},
                             # data.table library
                             data_table = {mtcars_data_table[disp >= 200, c("mpg", "cyl"),]},
                             # dplyr library
                             dplyr = {mtcars %>%
                                 dplyr::filter(disp >= 200) %>%
                                 dplyr::select(mpg,cyl)}
)

There is also some note for the different methods. julia_command will not convert the result on the julia side back into R. But julia_call and julia_eval will (by default). And both julia_eval and julia_call have an argument called need_return, which controls whether and how the result is returned into R.

Hope the information is helpful. I think I will further write this information out like a vignette or something.

phillc73 commented 3 years ago

Thanks! That's a tonne of good information. I really appreciate it.

Everything looks good, apart from one thing.

This sequence does not work:

julia_command("
function queryjl(mtcars)
    x = @from i in mtcars begin
        @where i.disp > 200
        @select {i.mpg, i.cyl}
        @collect DataFrame
    x
    end
end")

julia_command("queryjl(mtcars);")

Results in:

UndefVarError: mtcars not defined
Stacktrace:
 [1] top-level scope at none:1
 [2] eval(::Module, ::Any) at ./boot.jl:331
 [3] eval_string(::String) at /home/phillc/R/x86_64-pc-linux-gnu-library/4.0/JuliaCall/julia/setup.jl:203
 [4] docall(::Ptr{Nothing}) at /home/phillc/R/x86_64-pc-linux-gnu-library/4.0/JuliaCall/julia/setup.jl:176

However, both of these do return the correct result:

julia_call("queryjl", mtcars)

mtcars_julia_object <- JuliaObject(mtcars)
julia_call("queryjl", mtcars_julia_object)
Non-Contradiction commented 3 years ago

In julia_command("queryjl(mtcars);"), the mtcars refer to variable on the julia side. So we need to do julia_assign("mtcars", mtcars) as your original code. In both julia_call("queryjl", mtcars_julia_object) and julia_call("queryjl", mtcars), the mtcars_julia_object and mtcars refer to variable on the R side. So we don't need to do any julia_assign thing.