Open e-kotov opened 1 month ago
it does seem to be an issue on certain systems. See #8 and https://github.com/ipeaGIT/r5r/issues/393 . Testing in https://github.com/e-kotov/r5r-containerized
@s-u, hello, may I please have your opinion on what is happening?
Testing the issue here: https://github.com/e-kotov/r5r-containerized
First, we just install and link Java (by setting environment variables JAVA_HOME and PATH) and try to check Java version using rJava::.jinit(); rJava::.jcall('java.lang.System', 'S', 'getProperty', 'java.version')
. We do this by first running a child process, then we also try the same from the current R session where Java environment is already preset.
Here's how this goes:
> java_distr <- rJavaEnv::java_download(21)
Detected platform: linux
Detected architecture: x64
You can change the platform and architecture by specifying the `platform` and `arch` arguments.
Downloading Java 21 (Corretto) for linux x64 to
/home/rstudio/.cache/R/rJavaEnv/distrib/amazon-corretto-21-x64-linux-jdk.tar.gz
File already exists. Skipping download.
> java_home <- rJavaEnv::java_install(java_distr)
Java distribution amazon-corretto-21-x64-linux-jdk.tar.gz already unpacked at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Session: JAVA_HOME and PATH set to /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Project/Working Directory: JAVA_HOME and PATH set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21' in .Rprofile in '/home/rstudio/.Rprofile'
Java 21 (amazon-corretto-21-x64-linux-jdk.tar.gz) for linux x64 installed at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21 and symlinked to
/home/rstudio/rjavaenv/linux/x64/21
>
> print(java_home)
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
>
> rJavaEnv::java_check_version_cmd() #
JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
Java path: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21/bin/java
Java version: "openjdk version \"21.0.3\" 2024-04-16 LTS OpenJDK Runtime Environment Corretto-21.0.3.9.1
(build 21.0.3+9-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.3.9.1 (build 21.0.3+9-LTS, mixed mode,
sharing)"
[1] TRUE
> rJavaEnv::java_check_version_rjava() # the internals of the function are basically below
Using current session's JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
With the user-specified JAVA_HOME rJava and other rJava/Java-based packages will use Java version:
"21.0.3"
[1] TRUE
>
> Sys.getenv("JAVA_HOME")
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
>
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in a separate (but child) R process
> r_script <- "
+ tryCatch({
+ java_home <- '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
+ Sys.setenv(JAVA_HOME = java_home)
+ old_path <- Sys.getenv('PATH')
+ new_path <- file.path(java_home, 'bin')
+ Sys.setenv(PATH = paste(new_path, old_path, sep = .Platform$path.sep))
+ suppressWarnings(rJava::.jinit())
+ suppressWarnings(java_version <- rJava::.jcall('java.lang.System',
+ 'S', 'getProperty', 'java.version'))
+ message <- cli::format_message('rJava and other rJava/Java-based packages will use Java version: {.val {java_version}}')
+ print(message)
+ }, error = function(e) {
+ message <- cli::format_message('An error occurred: {.val {e$message}}')
+ print(message)
+ })
+ "
>
>
> script_file <- tempfile(fileext = ".R")
> writeLines(r_script, script_file)
>
> output <- system2("Rscript", args = script_file, stdout = TRUE, stderr = TRUE)
>
> cat(output, sep = "\n")
[1] "rJava and other rJava/Java-based packages will use Java version: \"21.0.3\""
> file.remove(script_file)
[1] TRUE
>
>
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in the current R session
> rJava::.jinit()
> rJava::.jcall("java.lang.System", "S", "getProperty", "java.version")
[1] "11.0.23"
>
So if we do not run sudo R CMD javareconf
when building the container, the child process detects Java 21, and current process detects Java 11 (see the last three commands and their output). The question is - why? I am especially surprised that the child process gets it right and in current R session, despite correctly set JAVA_HOME and PATH, R still picks up the system Java 11...
Now we are getting smarter and right after installing the {rJavaEnv}
we set the environment in the container to the correct new v21 Java and run R CMD javareconf
.
> java_distr <- rJavaEnv::java_download(21)
Detected platform: linux
Detected architecture: x64
You can change the platform and architecture by specifying the `platform` and `arch` arguments.
Downloading Java 21 (Corretto) for linux x64 to
/home/rstudio/.cache/R/rJavaEnv/distrib/amazon-corretto-21-x64-linux-jdk.tar.gz
File already exists. Skipping download.
> java_home <- rJavaEnv::java_install(java_distr)
Java distribution amazon-corretto-21-x64-linux-jdk.tar.gz already unpacked at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Session: JAVA_HOME and PATH set to /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
✔ Current R Project/Working Directory: JAVA_HOME and PATH set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21' in .Rprofile in '/home/rstudio/.Rprofile'
Java 21 (amazon-corretto-21-x64-linux-jdk.tar.gz) for linux x64 installed at
/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21 and symlinked to
/home/rstudio/rjavaenv/linux/x64/21
>
> print(java_home)
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
>
> rJavaEnv::java_check_version_cmd() #
JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
Java path: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21/bin/java
Java version: "openjdk version \"21.0.3\" 2024-04-16 LTS OpenJDK Runtime Environment
Corretto-21.0.3.9.1 (build 21.0.3+9-LTS) OpenJDK 64-Bit Server VM Corretto-21.0.3.9.1 (build
21.0.3+9-LTS, mixed mode, sharing)"
[1] TRUE
> rJavaEnv::java_check_version_rjava() # the internals of the function are basically below
Using current session's JAVA_HOME: /home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21
With the user-specified JAVA_HOME rJava and other rJava/Java-based packages will use Java version:
"21.0.3"
[1] TRUE
>
> Sys.getenv("JAVA_HOME")
[1] "/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21"
>
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in a separate (but child) R process
> r_script <- "
+ tryCatch({
+ java_home <- '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
+ Sys.setenv(JAVA_HOME = java_home)
+ old_path <- Sys.getenv('PATH')
+ new_path <- file.path(java_home, 'bin')
+ Sys.setenv(PATH = paste(new_path, old_path, sep = .Platform$path.sep))
+ suppressWarnings(rJava::.jinit())
+ suppressWarnings(java_version <- rJava::.jcall('java.lang.System',
+ 'S', 'getProperty', 'java.version'))
+ message <- cli::format_message('rJava and other rJava/Java-based packages will use Java version: {.val {java_version}}')
+ print(message)
+ }, error = function(e) {
+ message <- cli::format_message('An error occurred: {.val {e$message}}')
+ print(message)
+ })
+ "
>
>
> script_file <- tempfile(fileext = ".R")
> writeLines(r_script, script_file)
>
> output <- system2("Rscript", args = script_file, stdout = TRUE, stderr = TRUE)
>
> cat(output, sep = "\n")
[1] "rJava and other rJava/Java-based packages will use Java version: \"21.0.3\""
> file.remove(script_file)
[1] TRUE
>
>
> # Check Java version if JAVA_HOME
> # is set to '/home/rstudio/.cache/R/rJavaEnv/installed/linux/x64/21'
> # in the current R session
> rJava::.jinit()
> rJava::.jcall("java.lang.System", "S", "getProperty", "java.version")
[1] "21.0.3"
>
Now the Java versions match.
This problem does not happen on my aarch64 macOS, but it did happen to a person with Intel-based mac, documented here https://github.com/ipeaGIT/r5r/issues/393 . Also this does not seem to be an issue on Windows.
Changing env vars like PATH
or JAVA_HOME
manually is not supported and will led to chaos at is will break the entire setup of other variables. Also note that Rscript
cannot be invoked directly without R
since it bypasses the Java configuration so use R CMD Rscript
if you want to use Java inside Rscript
. I have no idea what rJavaEnv
is doing, but the code above looks highly illegal.
The whole point of javareconf
is to automatically detect all settings and set all env vars correctly during R
start-up such that they work with the JDK/JRE present the the time of invocation (for details see $R_HOME/etc/javaconf
and $R_HOME/etc/ldpaths
). If you want to manually override the R
behavior, you're on your own and you have to make sure you set all the necessary environment variables (they are OS-dependent and so is the whole Java start-up procedure: macOS and Windows use dynamic loading of JRI, while other unix system use linking). Some variables must be set before the R process is started, so they cannot be changed at run-time (which is why javareconf
is used).
@s-u thank you for having a look! {rJavaEnv}
aims to provide users with a Java installation in one simple command, allowing them to focus on the Java-dependent R package they are using for their project without dealing with Java itself.
I understand that changing environment variables like PATH and JAVA_HOME inside the R session is a bit of a hack, but it's necessary to simplify things for the user. Behind the scenes, {rJavaEnv}
overrides these environment variables (both in the current session and by adding code to do so in .Rprofile). This approach works well enough on most setups I have tested, allows users to run the Java-dependent packages, and spares them from having to worry about javareconf
.
It seems that system linking versus dynamic loading is what makes things work on macOS and Windows but causes issues in a Linux container. I will also check for other environment variable differences between the setups that work and those that do not.
@e-kotov Please read the above comment - I would strongly suggest that you familiarise yourself with the concepts involved here first - you should really understand how rJava
works before you can start thinking about hacking it to make it do things that it was not designed to do. You cannot change the JDK from a running process, because the link paths are set by the executable at launch time (see LD_LIBRARY_PATH
on Linux) - that's what the R
shell script does - it sets up the environment such that the actual exec/R
process will load JNI libraries from the location specified by javareconf
(see the files quoted above to understand how it works). You can only change the linking path for a future R
processes that you start after any change to the environment vars, but not for the current process (there is a "hacky" way involving masking dynamic libraries, but that is apparently way beyond the scope here).
In addition, modifying .Rprofile
is a really, really bad idea and is illegal on CRAN (for very good reason). I hope you are not trying to change PATH
or JAVA_HOME
in .Rprofile
as that will result in a broken R installation.
Finally, there either is a very good reason why R has a global Java setting that is used at start time, because JVM can only be loaded once, so you cannot have two packages loading different JVM versions.
@s-u thank you for more detailed explanations of why my current approach raises concerns. I did not have time yet to read more on the technical details you are referring to, but that is definitely on my to-do list, as I want the package to succeed.
I am not changing the global .Rprofile
in the home user directory (even though there is a request to add this feature #6 ), but only adding lines to the current working/project directory .Rpofile
, in the same way renv
does. So I am not sure if you meant only the global one, or also the .Rprofile
in the current working/project directory. If you referred to the latter too, than I am surprised to hear that it is "illegal on CRAN", as renv
does that. So I assume you are only referring to the global .Rprofile
.
I know very well about the issue of not being able to re-initialise Java without restarting the R session. I have experienced it myself multiple times, and this is one of the key reasons I started developing rJavaEnv
. I have to (in my opinion) resort to some hacky approaches to make things simpler for the user while not breaking their system. This is why if you read rJavaEnv
code (I'm not suggesting that you do, I'm sure you don't have time for this), you will see that it tries to do things with as little intervention on the user's system as possible:
java_download()
puts Java distributions into package cache folder, so that it is cleared if the user deletes the package
https://github.com/e-kotov/rJavaEnv/blob/25f28e93eb4498028cd03436a1aa2098e39a9a31/R/java_download.R#L21
java_install()
sets up Java also in the package cache
https://github.com/e-kotov/rJavaEnv/blob/25f28e93eb4498028cd03436a1aa2098e39a9a31/R/java_install.R#L36-L37
java_install()
links java bin folder to the current project subdirectory (like renv
does)
https://github.com/e-kotov/rJavaEnv/blob/25f28e93eb4498028cd03436a1aa2098e39a9a31/R/java_install.R#L86
java_env_set()
by default writes to project/current directory .Rprofile
to ensure JAVA_HOME and PATH are set at startup before a user has a chance to run .jinit()
.
java_env_set()
clearly marks all lines it adds to the .Rprofile
with #rJavaEnv
comment, so that (1) if something goes wrong, a user can find these lines and delete them manually, (2) java_env_unset()
can reliably remove only rJavaEnv
's own lines and keep the rest of the .Rpofile
intact. A better approach could be to only add one line, like renv
does, that calls another R script that does the rest of the job of setting the environment variables.
https://github.com/e-kotov/rJavaEnv/blob/25f28e93eb4498028cd03436a1aa2098e39a9a31/R/java_env.R#L76-L84
java_check_version_rjava()
uses a separate child R process to check if JAVA_HOME set in the current session (or whichever path specified as argument) would make R pick up the Java version required by a particular package, such as r5r
. This is done exactly because any other way would force the user to restart R if they fail to install the correct Java version and do .jinit()
.
using java_env_set()
with where = "session"
argument allows to use it in the same project in different R scripts, if these are executed in independent R processes, for example by targets
.
And from my point of view, however hacky all that seems, if it works, solves the user's problem, and does not break things (which so far it does not seem to do), it is worth doing.
To conclude, I will look into the issue that is coming up with my current approach on Linux and see what are the ways to address it. Perhaps instead of more hacking, it will be a simple console message to the user, informing them to run R CMD javareconf
.
I suspect you see this project as amateur (which it is, I'm not a software engineer) and scary ("illegal on CRAN") from a perspective of a person with much more experience than me, but I hope you can also appreciate the pain of the users who try to configure Java to just get to the the analysis part, instead of having to deal with technical issues. And if rJavaEnv
helps that with that, I would say it is worth continuing to work on. You could argue that users just need to read the documentation. But we all know that it is hard to read all the documentation on every little bit of the software you are using. I will definitely read more about how Java interacts with R before I proceed.
@e-kotov Yes, I was referring to ~/.Rprofile
- if you are talking about a separate directory for a separate process then that's fine. However, for reasons explained above .Rprofile
is way too late - you have to set the env vars before the R process is started so you'd be better off creating a startup-script.
As for your last part - I have seen far too many projects that try to be "helpful" while all they do is cause a much bigger problems due to basic lack of understanding of the pieces involved. I'm not saying it is necessarily the case here, but it is something that is highly frustrating as such "features" are typically far more easily addressed upstream only if someone bothered to signal the desire for them.
To get back to the issue here - it is technically trivial for rJava
load a specific desired libjvm
. However, the true complexity lies in determining the correct locations, because it is not standardized across JVMs. Some JVMs require pre-loading of additional libraries or additional library locations. As you noted, on macOS rJava
is already using dynamic detection, because of the way JVMs work on macOS. Linux lacks some of the features, but if one could rely on a correct value of JAVA_HOME
then it would be possible to use a similar logic there as well.
The main point I was trying to get across here is that the logic is far more complex that you think: you have to either a) use javareconf
or b) replicate the logic and setup from javareconf
or c) have rJava
support dynamic JVM load. In my opinion b) is the least sensible solution as any changes in R will require you to follow suit and update your code. I would go for c) as it is the most reliable and easiest for the 3rd party package - at least as long as we are concerned mainly with JRE (for JDK it gets more complicated).
@s-u
you have to set the env vars before the R process is started so you'd be better off creating a startup-script.
I fully understand why that would be the preferred way, but I am sure you can also see how that would affect the experience of an ordinary R user.
I have seen far too many projects that try to be "helpful" while all they do is cause a much bigger problems
Thankfully, so far there were no problems and the project is actually helpful (to quote https://github.com/e-kotov/rJavaEnv/issues/6#issuecomment-2183566955 ):
The workshop was very successful with many attendees and almost everyone of them were able reproduce all the code of the workshop. The {rJavaEnv} as really really important. Installing Java is the most "difficult" part of learning {r5r} and many people did not have Java installed. {rJavaEnv} was super handy! There were only two or three people who couldn't get Java installed with {rJavaEnv}
only if someone bothered to signal the desire for them.
I have followed many issues on StackOverflow and Issues in {rJava}
's repository that discuss this issue of not being able to reinitialise JVM twice in the same R session and in general having issues with JAVA_HOME path... So I am not sure I can agree that there was no signal. And I think my project can be seen as a form of signalling too 😉. Just in case, this is not sarcasm or me suggesting that {rJava}
is doing a bad job. As I said, I followed the conversations and I read about the technical issues that prevent the desired behaviour (such as being able to .jinit()
more than once without restart). If my project somehow eventually leads to some upstream changes, and {rJavaEnv}
itself becomes obsolete, I would be fine with that too. But how upstream do you think the change should be? With my limited understanding, it has to be in Java itself, and I do not see that happening just because of this relatively small (compared to the "Java world") issues with Java in our R community. Maybe I am wrong.
Thank you again for noting to me the intricacies of JVM loading in R and suggesting the options, this is very valuable and helpful for changing the {rJavaEnv}
's internals in the safest way possible.
I have already taken so much of your time, but I would appreciate if you could elaborate a bit more on option c:
c) have
rJava
support dynamic JVM load
Do I get it right that this is already supported on macOS and Windows? To quote you from previous replies:
macOS and Windows use dynamic loading of JRI, while other unix system use linking
on macOS rJava is already using dynamic detection, because of the way JVMs work on macOS
From that, I suspect "dynamic JVM load" is what is already happening on macOS and Windows, as my approach works on macOS (mostly, with few exceptions https://github.com/ipeaGIT/r5r/issues/393#, seemingly, currently on Intel Macs) and Windows systems. But please correct me if I am wrong.
And then, is that something that you as the author of {rJava}
could implement for Linux? Or because of:
Linux lacks some of the features, but if one could rely on a correct value of JAVA_HOME then it would be possible to use a similar logic there as well.
it is not possible to do anything about it in {rJava
} on Linux?
Meanwhile, I will need some time to process the new information you pointed me to and have a thorough read on the process of JVM loading before I proceed.
"To configure R to point to the desired Java installation, use the
R CMD javareconf
command. " Source: https://solutions.posit.co/envs-pkgs/using-rjava/#reconfigure-rHow to address that in the package? Do we even care about that, as long as we are not compiling any Java software or Java-dependent packages from source? Is the package going to meet the needs of 99% of it's potential users without managing this
R CMD javareconf
issue, which will probably require R session restart to take effect?