e-kotov / rJavaEnv

Java Environments for R Projects
http://www.ekotov.pro/rJavaEnv/
Other
7 stars 0 forks source link

consider the `R CMD javareconf ` #3

Open e-kotov opened 1 month ago

e-kotov commented 1 month ago

"To configure R to point to the desired Java installation, use the R CMD javareconf command. " Source: https://solutions.posit.co/envs-pkgs/using-rjava/#reconfigure-r

How to address that in the package? Do we even care about that, as long as we are not compiling any Java software or Java-dependent packages from source? Is the package going to meet the needs of 99% of it's potential users without managing this R CMD javareconf issue, which will probably require R session restart to take effect?

e-kotov commented 2 weeks ago

it does seem to be an issue on certain systems. See #8 and https://github.com/ipeaGIT/r5r/issues/393 . Testing in https://github.com/e-kotov/r5r-containerized

e-kotov commented 2 weeks ago

@s-u, hello, may I please have your opinion on what is happening?

Testing the issue here: https://github.com/e-kotov/r5r-containerized

  1. https://github.com/e-kotov/r5r-containerized/tree/experiment-no-javareconf

First, we just install and link Java (by setting environment variables JAVA_HOME and PATH) and try to check Java version using rJava::.jinit(); rJava::.jcall('java.lang.System', 'S', 'getProperty', 'java.version'). We do this by first running a child process, then we also try the same from the current R session where Java environment is already preset.

Here's how this goes:

> java_distr <- rJavaEnv::java_download(21)
Detected platform: linux
Detected architecture: x64
You can change the platform and architecture by specifying the `platform` and `arch` arguments.
Downloading Java 21 (Corretto) for linux x64 to
/home/rstudio/.cache/R/rJavaEnv/distrib/amazon-corretto-21-x64-linux-jdk.tar.gz
File already exists. Skipping download.
> java_home <- rJavaEnv::java_install(java_distr)
Java distribution amazon-corretto-21-x64-linux-jdk.tar.gz already unpacked at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Session: JAVA_HOME and PATH set to /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Project/Working Directory: JAVA_HOME and PATH set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21' in .Rprofile in '/home/rstudio/.Rprofile'
Java 21 (amazon-corretto-21-x64-linux-jdk.tar.gz) for linux x64 installed at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21 and symlinked to
/home/rstudio/rjavaenv/linux/x64/21
> 
> print(java_home)
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
> 
> rJavaEnv::java_check_version_cmd() #
JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
Java path: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21/bin/java
Java version: "openjdk version \"21.0.3\" 2024-04-16 LTS OpenJDK Runtime Environment Corretto-21.0.3.9.1
(build 21.0.3+9-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.3.9.1 (build 21.0.3+9-LTS, mixed mode,
sharing)"
[1] TRUE
> rJavaEnv::java_check_version_rjava() # the internals of the function are basically below
Using current session's JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
With the user-specified JAVA_HOME rJava and other rJava/Java-based packages will use Java version:
"21.0.3"
[1] TRUE
> 
> Sys.getenv("JAVA_HOME")
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
> 
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in a separate (but child) R process
> r_script <- "
+ tryCatch({
+     java_home <- '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
+     Sys.setenv(JAVA_HOME = java_home)
+     old_path <- Sys.getenv('PATH')
+     new_path <- file.path(java_home, 'bin')
+     Sys.setenv(PATH = paste(new_path, old_path, sep = .Platform$path.sep))
+     suppressWarnings(rJava::.jinit())
+     suppressWarnings(java_version <- rJava::.jcall('java.lang.System', 
+       'S', 'getProperty', 'java.version'))
+     message <- cli::format_message('rJava and other rJava/Java-based packages will use Java version: {.val {java_version}}')
+     print(message)
+ }, error = function(e) {
+     message <- cli::format_message('An error occurred: {.val {e$message}}')
+     print(message)
+ })
+ "
> 
> 
> script_file <- tempfile(fileext = ".R")
> writeLines(r_script, script_file)
> 
> output <- system2("Rscript", args = script_file, stdout = TRUE, stderr = TRUE)
> 
> cat(output, sep = "\n")
[1] "rJava and other rJava/Java-based packages will use Java version: \"21.0.3\""
> file.remove(script_file)
[1] TRUE
> 
> 
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in the current R session
> rJava::.jinit()
> rJava::.jcall("java.lang.System", "S", "getProperty", "java.version")
[1] "11.0.23"
> 

So if we do not run sudo R CMD javareconf when building the container, the child process detects Java 21, and current process detects Java 11 (see the last three commands and their output). The question is - why? I am especially surprised that the child process gets it right and in current R session, despite correctly set JAVA_HOME and PATH, R still picks up the system Java 11...

  1. https://github.com/e-kotov/r5r-containerized/tree/experiment-with-javareconf

Now we are getting smarter and right after installing the {rJavaEnv} we set the environment in the container to the correct new v21 Java and run R CMD javareconf.

> java_distr <- rJavaEnv::java_download(21)
Detected platform: linux
Detected architecture: x64
You can change the platform and architecture by specifying the `platform` and `arch` arguments.
Downloading Java 21 (Corretto) for linux x64 to
/home/rstudio/.cache/R/rJavaEnv/distrib/amazon-corretto-21-x64-linux-jdk.tar.gz
File already exists. Skipping download.
> java_home <- rJavaEnv::java_install(java_distr)
Java distribution amazon-corretto-21-x64-linux-jdk.tar.gz already unpacked at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Session: JAVA_HOME and PATH set to /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Project/Working Directory: JAVA_HOME and PATH set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21' in .Rprofile in '/home/rstudio/.Rprofile'
Java 21 (amazon-corretto-21-x64-linux-jdk.tar.gz) for linux x64 installed at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21 and symlinked to
/home/rstudio/rjavaenv/linux/x64/21
> 
> print(java_home)
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
> 
> rJavaEnv::java_check_version_cmd() #
JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
Java path: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21/bin/java
Java version: "openjdk version \"21.0.3\" 2024-04-16 LTS OpenJDK Runtime Environment
Corretto-21.0.3.9.1 (build 21.0.3+9-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.3.9.1 (build
21.0.3+9-LTS, mixed mode, sharing)"
[1] TRUE
> rJavaEnv::java_check_version_rjava() # the internals of the function are basically below
Using current session's JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
With the user-specified JAVA_HOME rJava and other rJava/Java-based packages will use Java version:
"21.0.3"
[1] TRUE
> 
> Sys.getenv("JAVA_HOME")
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
> 
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in a separate (but child) R process
> r_script <- "
+ tryCatch({
+     java_home <- '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
+     Sys.setenv(JAVA_HOME = java_home)
+     old_path <- Sys.getenv('PATH')
+     new_path <- file.path(java_home, 'bin')
+     Sys.setenv(PATH = paste(new_path, old_path, sep = .Platform$path.sep))
+     suppressWarnings(rJava::.jinit())
+     suppressWarnings(java_version <- rJava::.jcall('java.lang.System', 
+       'S', 'getProperty', 'java.version'))
+     message <- cli::format_message('rJava and other rJava/Java-based packages will use Java version: {.val {java_version}}')
+     print(message)
+ }, error = function(e) {
+     message <- cli::format_message('An error occurred: {.val {e$message}}')
+     print(message)
+ })
+ "
> 
> 
> script_file <- tempfile(fileext = ".R")
> writeLines(r_script, script_file)
> 
> output <- system2("Rscript", args = script_file, stdout = TRUE, stderr = TRUE)
> 
> cat(output, sep = "\n")
[1] "rJava and other rJava/Java-based packages will use Java version: \"21.0.3\""
> file.remove(script_file)
[1] TRUE
> 
> 
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in the current R session
> rJava::.jinit()
> rJava::.jcall("java.lang.System", "S", "getProperty", "java.version")
[1] "21.0.3"
> 

Now the Java versions match.

This problem does not happen on my aarch64 macOS, but it did happen to a person with Intel-based mac, documented here https://github.com/ipeaGIT/r5r/issues/393 . Also this does not seem to be an issue on Windows.

s-u commented 2 weeks ago

Changing env vars like PATH or JAVA_HOME manually is not supported and will led to chaos at is will break the entire setup of other variables. Also note that Rscript cannot be invoked directly without R since it bypasses the Java configuration so use R CMD Rscript if you want to use Java inside Rscript. I have no idea what rJavaEnv is doing, but the code above looks highly illegal.

The whole point of javareconf is to automatically detect all settings and set all env vars correctly during R start-up such that they work with the JDK/JRE present the the time of invocation (for details see $R_HOME/etc/javaconf and $R_HOME/etc/ldpaths). If you want to manually override the R behavior, you're on your own and you have to make sure you set all the necessary environment variables (they are OS-dependent and so is the whole Java start-up procedure: macOS and Windows use dynamic loading of JRI, while other unix system use linking). Some variables must be set before the R process is started, so they cannot be changed at run-time (which is why javareconf is used).

e-kotov commented 2 weeks ago

@s-u thank you for having a look! {rJavaEnv} aims to provide users with a Java installation in one simple command, allowing them to focus on the Java-dependent R package they are using for their project without dealing with Java itself.

I understand that changing environment variables like PATH and JAVA_HOME inside the R session is a bit of a hack, but it's necessary to simplify things for the user. Behind the scenes, {rJavaEnv} overrides these environment variables (both in the current session and by adding code to do so in .Rprofile). This approach works well enough on most setups I have tested, allows users to run the Java-dependent packages, and spares them from having to worry about javareconf.

It seems that system linking versus dynamic loading is what makes things work on macOS and Windows but causes issues in a Linux container. I will also check for other environment variable differences between the setups that work and those that do not.

s-u commented 1 week ago

@e-kotov Please read the above comment - I would strongly suggest that you familiarise yourself with the concepts involved here first - you should really understand how rJava works before you can start thinking about hacking it to make it do things that it was not designed to do. You cannot change the JDK from a running process, because the link paths are set by the executable at launch time (see LD_LIBRARY_PATH on Linux) - that's what the R shell script does - it sets up the environment such that the actual exec/R process will load JNI libraries from the location specified by javareconf (see the files quoted above to understand how it works). You can only change the linking path for a future R processes that you start after any change to the environment vars, but not for the current process (there is a "hacky" way involving masking dynamic libraries, but that is apparently way beyond the scope here).

In addition, modifying .Rprofile is a really, really bad idea and is illegal on CRAN (for very good reason). I hope you are not trying to change PATH or JAVA_HOME in .Rprofile as that will result in a broken R installation.

Finally, there either is a very good reason why R has a global Java setting that is used at start time, because JVM can only be loaded once, so you cannot have two packages loading different JVM versions.

e-kotov commented 1 week ago

@s-u thank you for more detailed explanations of why my current approach raises concerns. I did not have time yet to read more on the technical details you are referring to, but that is definitely on my to-do list, as I want the package to succeed.

I am not changing the global .Rprofile in the home user directory (even though there is a request to add this feature #6 ), but only adding lines to the current working/project directory .Rpofile, in the same way renv does. So I am not sure if you meant only the global one, or also the .Rprofile in the current working/project directory. If you referred to the latter too, than I am surprised to hear that it is "illegal on CRAN", as renv does that. So I assume you are only referring to the global .Rprofile.

I know very well about the issue of not being able to re-initialise Java without restarting the R session. I have experienced it myself multiple times, and this is one of the key reasons I started developing rJavaEnv. I have to (in my opinion) resort to some hacky approaches to make things simpler for the user while not breaking their system. This is why if you read rJavaEnv code (I'm not suggesting that you do, I'm sure you don't have time for this), you will see that it tries to do things with as little intervention on the user's system as possible:

And from my point of view, however hacky all that seems, if it works, solves the user's problem, and does not break things (which so far it does not seem to do), it is worth doing.

To conclude, I will look into the issue that is coming up with my current approach on Linux and see what are the ways to address it. Perhaps instead of more hacking, it will be a simple console message to the user, informing them to run R CMD javareconf.

I suspect you see this project as amateur (which it is, I'm not a software engineer) and scary ("illegal on CRAN") from a perspective of a person with much more experience than me, but I hope you can also appreciate the pain of the users who try to configure Java to just get to the the analysis part, instead of having to deal with technical issues. And if rJavaEnv helps that with that, I would say it is worth continuing to work on. You could argue that users just need to read the documentation. But we all know that it is hard to read all the documentation on every little bit of the software you are using. I will definitely read more about how Java interacts with R before I proceed.

s-u commented 1 week ago

@e-kotov Yes, I was referring to ~/.Rprofile - if you are talking about a separate directory for a separate process then that's fine. However, for reasons explained above .Rprofile is way too late - you have to set the env vars before the R process is started so you'd be better off creating a startup-script.

As for your last part - I have seen far too many projects that try to be "helpful" while all they do is cause a much bigger problems due to basic lack of understanding of the pieces involved. I'm not saying it is necessarily the case here, but it is something that is highly frustrating as such "features" are typically far more easily addressed upstream only if someone bothered to signal the desire for them.

To get back to the issue here - it is technically trivial for rJava load a specific desired libjvm. However, the true complexity lies in determining the correct locations, because it is not standardized across JVMs. Some JVMs require pre-loading of additional libraries or additional library locations. As you noted, on macOS rJava is already using dynamic detection, because of the way JVMs work on macOS. Linux lacks some of the features, but if one could rely on a correct value of JAVA_HOME then it would be possible to use a similar logic there as well.

The main point I was trying to get across here is that the logic is far more complex that you think: you have to either a) use javareconf or b) replicate the logic and setup from javareconf or c) have rJava support dynamic JVM load. In my opinion b) is the least sensible solution as any changes in R will require you to follow suit and update your code. I would go for c) as it is the most reliable and easiest for the 3rd party package - at least as long as we are concerned mainly with JRE (for JDK it gets more complicated).

e-kotov commented 1 week ago

@s-u

you have to set the env vars before the R process is started so you'd be better off creating a startup-script.

I fully understand why that would be the preferred way, but I am sure you can also see how that would affect the experience of an ordinary R user.

I have seen far too many projects that try to be "helpful" while all they do is cause a much bigger problems

Thankfully, so far there were no problems and the project is actually helpful (to quote https://github.com/e-kotov/rJavaEnv/issues/6#issuecomment-2183566955 ):

The workshop was very successful with many attendees and almost everyone of them were able reproduce all the code of the workshop. The {rJavaEnv} as really really important. Installing Java is the most "difficult" part of learning {r5r} and many people did not have Java installed. {rJavaEnv} was super handy! There were only two or three people who couldn't get Java installed with {rJavaEnv}

only if someone bothered to signal the desire for them.

I have followed many issues on StackOverflow and Issues in {rJava}'s repository that discuss this issue of not being able to reinitialise JVM twice in the same R session and in general having issues with JAVA_HOME path... So I am not sure I can agree that there was no signal. And I think my project can be seen as a form of signalling too 😉. Just in case, this is not sarcasm or me suggesting that {rJava} is doing a bad job. As I said, I followed the conversations and I read about the technical issues that prevent the desired behaviour (such as being able to .jinit() more than once without restart). If my project somehow eventually leads to some upstream changes, and {rJavaEnv} itself becomes obsolete, I would be fine with that too. But how upstream do you think the change should be? With my limited understanding, it has to be in Java itself, and I do not see that happening just because of this relatively small (compared to the "Java world") issues with Java in our R community. Maybe I am wrong.

Thank you again for noting to me the intricacies of JVM loading in R and suggesting the options, this is very valuable and helpful for changing the {rJavaEnv}'s internals in the safest way possible.

I have already taken so much of your time, but I would appreciate if you could elaborate a bit more on option c:

c) have rJava support dynamic JVM load

Do I get it right that this is already supported on macOS and Windows? To quote you from previous replies:

macOS and Windows use dynamic loading of JRI, while other unix system use linking

on macOS rJava is already using dynamic detection, because of the way JVMs work on macOS

From that, I suspect "dynamic JVM load" is what is already happening on macOS and Windows, as my approach works on macOS (mostly, with few exceptions https://github.com/ipeaGIT/r5r/issues/393#, seemingly, currently on Intel Macs) and Windows systems. But please correct me if I am wrong.

And then, is that something that you as the author of {rJava} could implement for Linux? Or because of:

Linux lacks some of the features, but if one could rely on a correct value of JAVA_HOME then it would be possible to use a similar logic there as well.

it is not possible to do anything about it in {rJava} on Linux?

Meanwhile, I will need some time to process the new information you pointed me to and have a thorough read on the process of JVM loading before I proceed.