conda-forge / maven-feedstock

A conda-smithy repository for maven.
BSD 3-Clause "New" or "Revised" License
2 stars 12 forks source link

Set maven local repository to be within CONDA_PREFIX #22

Open mkitti opened 1 year ago

mkitti commented 1 year ago

Solution to issue cannot be found in the documentation.

Issue

Currently, if a user uses maven the default local repository will be <localRepository>${user.home}/.m2/repository</localRepository>

https://maven.apache.org/settings.html#settings-details

Rather the repository should live In the CONDA_PREFIX. I propose the following location.

<localRepository>${env.CONDA_PREFIX}/opt/maven/repository</localRepository>

This could be added to opt/maven/conf/settings.xml.

Having the local repository in the CONDA_PREFIX would allow us to create packages that populate the repository with maven packages.

We could add the user's ${user.home}/.m2/repository added as an internal repository.

Installed packages

# packages in environment at /Users/kittisopikulm/miniforge3-arm64/envs/mvntest2:
#
# Name                    Version                   Build  Channel
libcxx                    16.0.3               h4653b0c_0    conda-forge
libzlib                   1.2.13               h03a7124_4    conda-forge
maven                     3.9.2                hce30654_1    <unknown>
tree                      2.1.0                h1a8c8d9_0    conda-forge
zstd                      1.5.2                hf913c23_6    conda-forge

Environment info

active environment : mvntest2
    active env location : /Users/kittisopikulm/miniforge3-arm64/envs/mvntest2
            shell level : 2
       user config file : /Users/kittisopikulm/.condarc
 populated config files : /Users/kittisopikulm/miniforge3-arm64/.condarc
          conda version : 4.14.0
    conda-build version : not installed
         python version : 3.10.6.final.0
       virtual packages : __osx=13.3.1=0
                          __unix=0=0
                          __archspec=1=arm64
       base environment : /Users/kittisopikulm/miniforge3-arm64  (writable)
      conda av data dir : /Users/kittisopikulm/miniforge3-arm64/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /Users/kittisopikulm/miniforge3-arm64/pkgs
                          /Users/kittisopikulm/.conda/pkgs
       envs directories : /Users/kittisopikulm/miniforge3-arm64/envs
                          /Users/kittisopikulm/.conda/envs
               platform : osx-arm64
             user-agent : conda/4.14.0 requests/2.28.1 CPython/3.10.6 Darwin/22.4.0 OSX/13.3.1
                UID:GID : 503:20
             netrc file : None
           offline mode : False
mkitti commented 1 year ago

I propose we install this in $CONDA_PREFIX/opt/maven/conf/settings.xml:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  <localRepository>${env.CONDA_PREFIX}/opt/maven/repository</localRepository>
  <profiles>
    <profile>
      <id>conda-user-home</id>
      <activation>
        <activeByDefault>true</activeByDefault>
      </activation>
      <repositories>
        <repository>
          <id>userHome</id>
          <name>User Home Repository</name>
          <releases>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
            <checksumPolicy>warn</checksumPolicy>
          </releases>
          <snapshots>
            <enabled>true</enabled>
            <updatePolicy>never</updatePolicy>
            <checksumPolicy>warn</checksumPolicy>
          </snapshots>
          <url>file://${user.home}/.m2/repository</url>
        </repository>
      </repositories>
    </profile>
  </profiles>
</settings>
mkitti commented 1 year ago

@conda-forge/maven Is anyone closely invested in maven keeping its settings.localRepository at ${user.home}/.m2/repository or in anything else with the package before I charge ahead.

mkitti commented 1 year ago

You can override this and change it back to your HOME local repository by putting this in your ~/.m2/settings.xml

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 https://maven.apache.org/xsd/settings-1.0.0.xsd">
  <localRepository>${user.home}/.m2/repository</localRepository>
</settings>
kephale commented 1 year ago

@mkitti most of my usage these days would really benefit from defaulting to keeping the m2 repo within the conda environment, so this is awesome. I suspect @frauzufall will be interested in this as well.

I guess most folks who are doing day-to-day Java aren't using conda's maven to drive their IDE builds, so they just have to either deal with adjusting the path or duplicate files if they need to use this packaging of maven.

Will this help make it trivial to package conda environments with pre-loaded local maven repos?

mkitti commented 1 year ago

@kephale thanks for the feedback. What I'm really missing at the moment is the ability to declare dependencies between non-Java and Java components.

One of the biggest non-Java components is the OpenJDK itself. Other components are things like compression codecs, Python, HDF5, etc.

I am also thinking of still using JavaCPP to build Java Native Interface bindings for Java before the foreign API.

ctrueden commented 1 year ago

I have mixed feelings about this change. It is probably correct, in that: A) conda environments are supposed to stay as encapsulated as possible; and B) the mvn command is not considered multi-process-safe when using the same local repo cache.

However, there is a major downside: huge network traffic and wait time and disk usage increase when using multiple conda environments. And we will use multiple environments in our community: my plan is for Appose + conda to serve as connecting tissue between Fiji plugins that leverage otherwise-incompatible codebases. If you make this change, users will be waiting a lot more, needlessly IMHO, to download the same JARs repeatedly.

Asking users to configure their settings.xml is IMHO not acceptable, since 99+% of people will use the defaults we provide in this context.

I am considering enhancing jgo to: A) use cjdk or install-jdk to download JDKs on demand; and B) use mvnw or some other Maven-bootstrapper to get Maven installed. Once that works, scyjava wouldn't need to depend on the openjdk nor maven packages in conda-forge anymore. So, if you need to make this change, I won't fight too hard, but I will change jgo so the change becomes irrelevant to Fiji/ImageJ's Appose+conda-based logic. Unless I am missing something here...?

mkitti commented 1 year ago

However, there is a major downside: huge network traffic and wait time and disk usage increase when using multiple conda environments. And we will use multiple environments in our community: my plan is for Appose + conda to serve as connecting tissue between Fiji plugins that leverage otherwise-incompatible codebases. If you make this change, users will be waiting a lot more, needlessly IMHO, to download the same JARs repeatedly.

Why would multiple conda environments create huge network traffic?

  1. In the use case where one has installed maven into a conda environment, then one likely is only pulling in conda packages which are populating the conda environment. Conda will use the conda pkgs cache to populate the maven repository via hard links. There would be no additional disk usage or network usage in this case. The primary need here is when someone is trying to use conda to install a conda packaged Java dependency that has a non-Java dependency.
  2. As configured above, I've also added the user's home maven repository as an "internal" remote repository. In this case, maven could populate the conda maven repository from the user's home maven repository if the packages already exist there.
  3. jgo configures it's own maven repository to the user home repository and has it's own configuration file at the moment. https://github.com/scijava/jgo/blob/2d98c803a3d30cc286876a6cb750b3fbc73dfb83/src/jgo/jgo.py#L433
mkitti commented 1 year ago

Will this help make it trivial to package conda environments with pre-loaded local maven repos?

Yes, this is the primary advantage of doing this. We can use conda packages to populate a maven repository as well their needed external dependencies simultaneously.

ctrueden commented 1 year ago

My concern was with each environment's copy of Maven pulling in its own copy of all JAR files requested. Unlike conda packages, these would not be hard linked. This seems wasteful, especially compared to how conda packages are handled by conda itself. Would it not be ideal if release versions of Maven artifacts were downloaded and cached in one single place independently of environment, analogously to how conda uses pkgs for its package cache?

That said, you are right that for the Appose use case I outlined above—Java main process + Python-inside-conda child processes—this should in fact not be an issue, because in the typical case, Fiji would not be creating environments that include openjdk nor maven.

Unfortunately, at the moment, since we haven't finished solving named shared memory from Java yet, the demo I made back in February uses a Python parent process with embedded Java via PyImageJ, and a Python child process with embedded Java via PyImageJ, and each of these embedded Javas uses jgo to load ImageJ2. With the change you are proposing, I was concerned that multiple copies of ImageJ2 would be downloaded, which would be suboptimal. However, you are also correct that jgo explicitly sets M2_REPO to ~/.m2/repository by default, which I forgot about, so maybe there are no problematic cases for my applications after all.

Appose also supports Java main + Java children, as well as Python main + Java children, but for these cases the issue may also be moot: the way Appose is currently coded, Java children are invoked via Groovy, and dependencies are pulled down by Groovy @Grab annotations, which typically stores JARs into ~/.groovy/grapes IIRC. These Java children would not even need to live inside conda environments, so this change would also not affect these cases.

My only remaining concern then is that it sounds like you are wanting to move toward packaging Java JAR files as conda packages? I think I already aired my opinion on this, but I think that is a big can of worms that should probably not be opened if it can be avoided. I haven't seen a case where you need to do that. For things like libblosc, you can ship the native libs via conda, and then load them from Java using System.loadLibrary without needing the Java part to be packaged in conda. I can appreciate the elegance of having that Java code packaged in conda and depending on blosc, but it seems like way more trouble than it's worth, given that it is technical feasible to address the dependency without doing the packaging of Java code in this way. Just ship an environment.yml with your Java code and let Appose (or whatever) construct the environment for the needed natives and call it good.

TL;DR: Sorry for the (mostly) noise.

mkitti commented 1 year ago

There is some chance that we may need to patch some Java packages at build time to function properly within a Conda environment depending on their native library loading mechanism.

For example, consider the case of JBlosc and the need for a HDF5 Blosc plugin for JHDF5 and JavaCPP-HDF5. Preferably, these would all need access to a common Blosc library and a common HDF5 library. Having more than one of these loaded into a single process can be problematic. Rather we might want to embed a configuration for all of these to use the libraries installed by conda rather than vendoring the libraries from within the JAR files in some cases. However, each uses an independent mechanism to locate these libraries. Within FIJI, we do have a mechanism to point them at a common library.

In this case, where we have Java code specialized for a conda environment, we would want those packages isolated within the conda prefix. We may also want those to be accessible to maven within that same conda environment.

mkitti commented 1 year ago

Would it not be ideal if release versions of Maven artifacts were downloaded and cached in one single place independently of environment, analogously to how conda uses pkgs for its package cache?

Note that there are two pieces of XML I have posted above.

The first configures the local repository where maven will look for local packagss.

<localRepository>${env.CONDA_PREFIX}/opt/maven/repository</localRepository>

The second part is to tell maven to look at the user's home repository as well.

      <repositories>
        <repository>
          <id>userHome</id>
          <name>User Home Repository</name>
          <releases>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
            <checksumPolicy>warn</checksumPolicy>
          </releases>
          <snapshots>
            <enabled>true</enabled>
            <updatePolicy>never</updatePolicy>
            <checksumPolicy>warn</checksumPolicy>
          </snapshots>
          <url>file://${user.home}/.m2/repository</url>
        </repository>
      </repositories>
ctrueden commented 1 year ago

The second part is to tell maven to look at the user's home repository as well.

This is a clever hack, but it does not cause the conda mvn to actually store newly downloaded things into a common location. It will only prevent re-download of things that the user already has in their user directory already. Most users will not have things there, so this will not save download bandwidth except for developers who are doing builds outside of conda.

The first configures the local repository where maven will look for local packagss.

Not only where maven will look (read), but also where it will cache (write) them.

mkitti commented 1 year ago

I'm starting to think about if it would make sense to invert the two repositories.

1) The only thing that writes to $CONDA_PREFIX/opt/maven/repository is conda. 2) If mvn is installed by conda then it will use $CONDA_PREFIX/opt/maven/repository as a repository for reading. 3) If conda-forge package builders use mvn during build, they should configure mvn to write to $CONDA_PREFIX/opt/maven/repository 4) If the end user uses mvn, then it will by default use ${user.home}/.m2/repository

ctrueden commented 1 year ago

@mkitti Nice idea! That could be really effective at achieving your goal of shipping some JARs with conda packages, while minimizing download duplication during normal Maven usage.