JetBrains / intellij-platform-gradle-plugin

Gradle plugin for building plugins for IntelliJ-based IDEs
https://plugins.jetbrains.com/docs/intellij/gradle-prerequisites.html
Apache License 2.0
1.44k stars 272 forks source link

Multiple copies of IDE distribution in Gradle cache + reindexing problem #1601

Open chashnikov opened 6 months ago

chashnikov commented 6 months ago

What happened?

I migrated a project to IntelliJ Platform Gradle Plugin 2.0 and specified intellijIdeaUltimate("2024.1") in build.gradle.kts and didn't change it. However, after a few hours of work, I found that the Gradle cache contains six copies of IDEA Ultimate distribution:

nik@nik-workstation:~/.gradle/caches/transforms-3$ find -name ideaIU-2024.1 ./e0d45b10a0ea56c67b8eef1a3248a586/transformed/ideaIU-2024.1 ./d89f9fc4990ceec8ab9fc252d20afe96/transformed/ideaIU-2024.1 ./80050e3b4d7632fe5e49c5042913887a/transformed/ideaIU-2024.1 ./7167bc64f2f585d0ea26534a74c92e9a/transformed/ideaIU-2024.1 ./b91bb87b6350df68a77c91dc42a6215e/transformed/ideaIU-2024.1 ./172cec470fbf4c2fa3c0a0f57e6b823c/transformed/ideaIU-2024.1

Since each distribution occupies 3.3Gb, it's quite a lot of space.

Steps to reproduce

Not when exactly new copies are created.

Gradle IntelliJ Plugin version

2.0.0-SNAPSHOT

Gradle version

8.5

Operating System

Linux

hsz commented 6 months ago

I'm aware of this problem and investigating the root cause. Expected behavior is that Gradle reuses already transformed artifact even if build configuration changes.

JojOatXGME commented 5 months ago

I encountered the same problem. I ended up with 197 GiB worth of JetBrains IDEs in ~/.gradle/caches/transforms-4 after only a some days of experimenting with 2.0.

PS: #1639 look like a duplicate of this issue.

I am using convention plugins, maybe that is part of the problem? I could imagine that Gradle uses the classpath of the plugin which defines the transformer as part of hash. After all, the transformer might have been changed between two versions of the plugin. With convention plugins, the transformer becomes effectively part of the convention plugin, which would mean Gradle would have to re-run the transformer every time I change my build logic in the convention plugin. But that is just a guess.

Undin commented 4 months ago

@hsz any updates on this (I've read https://github.com/JetBrains/intellij-platform-gradle-plugin/issues/1639#issuecomment-2149097095)?

hsz commented 4 months ago

The Kotlin Script I've created for purging IntelliJ Platform extracted archives from the Gradle Transformers Cache:

https://gist.github.com/hsz/0fc45e1a6fc9ef73d4e4f5960058bded

hsz commented 3 months ago

Hello, Jonathan! I am very well aware of this problem. Let me explain to you the state of the current implementation:

Adding the IntelliJ Platform dependency to the project resolves it from the IJ Maven Repository or CDN (download.jetbrains.com). Both sources provide the IntelliJ Platform archive — Maven gives ZIP and CDN, depending on your OS: DMG, ZIP, or TAR.GZ. Gradle fetches such an archive into the cache directory, like ~/.gradle/caches/modules-2/files-2.1/....

The Gradle IntelliJ Plugin 1.x is extracting content next to the archive, polluting the cache, but it works. With IntelliJ Platform Gradle Plugin 2.0, I decided to do it correctly and involve the artifact transformers mechanism provided by Gradle. The extracted content goes into a dedicated ~/.gradle/caches/transforms-4/[HASH] location, and the dependency can be correctly formed using native Gradle features.

After the implementation, it turned out, that Gradle calculates [HASH] using the project build classpath. This means that this hash changes whenever you update the IntelliJ Platform Gradle Plugin plugin or any other Gradle plugin in your project (or have a buildSrc local setup that changes). As a side effect, the IntelliJ Platform archive gets extracted again, the cache grows, and the IDE reindexes the IntelliJ Platform dependency again.

I already had a chat with Gradle folks about this issue, but there's no solution possible to keep relying on the artifact transformers feature. I'm currently trying to figure out another solution, and most likely, we'll introduce a custom cache location, such as ~/.intellijPlatform/ides/, to keep just a single copy of the extracted IDE. Keep your fingers crossed!

https://github.com/JetBrains/intellij-platform-gradle-plugin/issues/1639#issuecomment-2149097095

hsz commented 3 months ago

Unfortunately, it is impossible to extract the resolved IntelliJ Platform artifact (no matter if this is installer or a ZIP archive from the IntelliJ Maven repository) to a custom directory and reuse it later with Gradle dependencies mechanism without bReaking some of Gradle foundations.

I consider implementing the fix in the Gradle build system as the only possible solution.

Vanco commented 3 months ago

Maybe use the HASH of file (zip, dmg, tag.gz) instead of Gradle calculates one. @hsz

hsz commented 3 months ago

I have no impact on what is taken as an input:

https://github.com/gradle/gradle/blob/7457e89eed50aa0b9dbab3a521141f6b8ce4a073/platforms/software/dependency-management/src/main/java/org/gradle/api/internal/artifacts/transform/DefaultTransform.java#L683

Gradle considers the whole implementation classpath of the build script. The idea is to limit the input classpath and isolate it by specifying dependencies that affect the transformer output, such as:

dependencies {
    registerTransform(MyTransform::class.java) {
        from...
        to...
        isolated {
             classpath("transform:dependency-1:1.0")
             classpath("transform:dependency-2:1.0")
        }
    }
}

Having that implemented, I could pass the IntelliJ Platform dependency input, and Gradle will calculate the hash of the archive you mentioned. But that requires changes in the build system.

JojOatXGME commented 3 months ago

The idea is to limit the input classpath and isolate it by specifying dependencies that affect the transformer output

Not sure if this is the right place to discuss that, but I have a few ideas about that.

  1. When the transformer is part of a JPMS module, the module information could be used to discover all relevant modules. This way, Gradle could avoid using too broad inputs for the hash without adding any additional configuration. However, as soon as only one dependency of the transformer does not have a module-info, this stops working. So, it is probably not the best solution right now. Not even sure if Gradle's API itself has a module-info yet.

  2. Technically, a similar solution would be to recursively scan and hash all the classes used by the transformer. Not sure how well that performs. Extracting the dependencies in the form of classes from a class file should be fast since they are all listed in the constant pool. You don't have to scan the whole class file. Gradle also already reads a lot of class files for different purposes, so it is not a completely new concept. However, scanning all the classes recursively may still take more time than desired. The implementation would also not be 100% reliable when the transformer (indirectly) uses reflection or service discovery to resolve classes. On the other side, the current implementation of calculating the hash from the class path effectively means that Gradle already scans over all the class files (in compressed form) once. So, calculating the hash for one transform has the same (or maybe even better) scaling properties, but with a negative constant factor. The hash also cannot be reused for multiple transforms in one module.

  3. Alternatively, in contrast to your suggestion above, I am wondering if it would make sense to define transformers as modules.

    dependencies {
        registerTransform("transform:transform:1.0") {
            from...
            to...
        }
    }

    The module would then have to provide the transformer class using service discovery or some config file in the JAR. This way, you avoid the risk of having a mismatch between the transformer modules and the modules declared in the isolated block.

hsz commented 3 months ago

Thank you for your input, @JojOatXGME!

  1. Unfortunately, IntelliJ Platform dependencies have no module-info at all. I could try creating something on-the-fly, but...
  2. I'm not sure if we're on the same page here. Gradle creates a transformer output directory using the hash of the build script classpath — the transformed artifact itself is completely ignored at this point. And by the build script classpath I mean i.e. all plugins you have applied in the build script (java, org.jetbrains.kotlin.jvm, org.jetbrains.intellij.platform, etc.). If you update build script dependencies (2.0.0-rc1 -> 2.0.0-rc2), a hash of the classpath changes, and this leads to running the dependency transformer again. In my post above, I incorrectly mentioned that the IntelliJ Platform dependency archive is used for hashing.
  3. This sounds intriguing — but how could we create such a module?
JojOatXGME commented 3 months ago

2. I'm not sure if we're on the same page here. Gradle creates a transformer output directory using the hash of the build script classpath

Yes, I understood that. My thought was that technically, Gradle could hash the classes recursively referred by the transform, instead of hashing the whole classpath. If someone uses registerTransform(MyTransform::class.java), Gradle could only hash the classes used by MyTransform instead of hashing the whole class path. Anyway, there are a few caveats as mentioned in my previous comment, so I am not yet convinced by this solution myself. However, it might be worth to further investigating the feasibility.

3. This sounds intriguing — but how could we create such a module?

That is left to Gradle to define. (My ideas were about how Gradle could refine its transform feature, similar to how you mentioned the introduction of the isolated block as an example.) One solution may be that such a module must contain exactly one service provider of type TransformAction. Gradle would then load the module and use ServiceLoader to load the transform. Alternatively, Gradle could define that modules used for registerTransform must put the name of the transform class into META-INF/MANIFEST.MF. My biggest concern about this solution is that there is the potential for a recursive cycle. Transforms are used while resolving modules, but now Transforms are modules itself. However, it is probably better to resolve the transforms from the plugin repositories anyway and ignore the local transforms while doing so, in which case there would be no cycle.

JojOatXGME commented 3 months ago

@hsz Just out of curiosity and because I would like to receive updates on this topic, is there are issue for this topic on the site of Gradle?

hsz commented 3 months ago

@JojOatXGME There's no input on that from the Gradle side. I'll start working on this story in September. All the progress will be communicated in this thread.

AlexanderBartash commented 4 weeks ago

How to fix the issue with the transforms cache

Solution 1

IDE should be split into parts and published as separate artifacts into a real repository, so that we can declare the conventional Gradle dependencies on them without having to download 1-2Gb archive and gutting it during build into separate sub-artifacts.

There are already seems to be quite a few artifacts being published https://mvnrepository.com/artifact/com.jetbrains.intellij.platform

Solution 2

Gradle faces the same dilemma while downloading JDKs. It does not use transforms for that. So maybe we should not either and the problem won't exist. https://github.com/gradle/gradle/blob/c8b35809943a669b5a25a84117c60b2ddaf81bfc/platforms/jvm/toolchains-jvm-shared/src/main/java/org/gradle/jvm/toolchain/internal/install/DefaultJdkCacheDirectory.java#L142

Solution 3

The problem seems to originate from this constraint: https://docs.gradle.org/current/dsl/org.gradle.api.artifacts.transform.TransformOutputs.html#org.gradle.api.artifacts.transform.TransformOutputs:file(java.lang.Object)

For an absolute path, the location is registered as an output file of the TransformAction. The path is required to point to the InputArtifact or be inside it if the input artifact is a directory. Example:

For a relative path, a location for the output file is provided by Gradle, so that the TransformAction can produce its outputs at that location. The parent directory of the provided location is created automatically when calling the method. Example:

It may not work, but the above does not say that we can not read outside of the input artifact path. When we extract the IDEs we can create in the dir an additional file with a hash of the original ZIP file. Later when a new transform runs, we can search other transforms for an already extracted version and either create a sym link to it and in the code we can do .absolute().normalize() on the path and the link will be gone from the path, it will just use the other transform.

If sym links for some reason do not work, we can just write some special market file, which would let anyone trying to use this dir as a platform path know that it should be looked at another location.

And even if that does not work, we can still skip the extraction if we found an already extracted location and create a dependency on this empty dir created by the transform. Then we can rewrite artifact path using ComponentMetadataRule just like I did it in this PR:

https://github.com/JetBrains/intellij-platform-gradle-plugin/pull/1785/files#diff-550b94ce513e8351a885e962947d583d45c59187d91264e2f062f583b5cd5bce

Solution 4

The reason why this transform runs so often is this: https://github.com/gradle/gradle/blob/7457e89eed50aa0b9dbab3a521141f6b8ce4a073/platforms/software/dependency-management/src/main/java/org/gradle/api/internal/artifacts/transform/DefaultTransform.java#L683

Whenever build classpath changes, Gradle does not want to reuse the old transform cache.

To workaround this we can create a sub-plugin just for downloading & unzipping the IDE and almost never change it. But it may not help if changes in the build classpath of the "IntelliJ plugin project" also cause it to run.

In that case a sub-plugin could be created with a sole purpose download a zip, split it into artifacts and do a publish to maven local. Then the main plugin will declare dependencies on that.

Somewhat related

There is also a somewhat related issue that we have a fake Ivy repo in .intellijPlatform/localPlatformArtifacts. It is a problem because in a project with many sub-projects all XML files are written into a single directory. Also that dir is never cleared. There may be issues when different sub-projects depend on different IDE versions and overwrite XML files created by other projects, or worse resolve artifacts they were not supposed to resolve just because another sub-project created an XML file for it.

How to get rid of Ivy XMLs in ".intellijPlatform/localPlatformArtifacts"

Solution 5

bundledPlugins & mobules are not really separate from the IDE, they are a part of it, but depending on which are required we may need to create different "variants" of the IDE.

So instead of creating dependencies "bundledPlugin:Tomcat:242-EAP-SNAPSHOT" we will register a transformer parameterized with a list of requested plugins & modules, which will select proper bundled plugins & libs, like CollectorTransformer.

https://github.com/JetBrains/intellij-platform-gradle-plugin/blob/b9b6699ebb95c6c88ee188b9d528506744f24ee5/src/main/kotlin/org/jetbrains/intellij/platform/gradle/artifacts/transform/CollectorTransformer.kt#L4

Nothing changes in the current API for declaring dependencies, except that bundledPlugins won't be creating any real dependencies, but just communicating parameters to the CollectorTransformer registered on the fly.

Solution 6

I have also explored if we could Gradle's capabilities, in theory they seem to apply here well, because the IDE is a platform with many capabilities, like modules & plugins, which we want to request separately from the IDE. In theory, with capabilities, we could be declaring IDE dependency like on the example below.

This can be done either internally (when we process added bundled plugins & modules) in this Gradle plugin or exposed to the developers, so that they write something like the below in their build scripts.

dependencies {
    // https://plugins.jetbrains.com/docs/intellij/tools-intellij-platform-gradle-plugin-dependencies-extension.html
    intellijPlatform {
        create(
            // To avoid hard coding these names this plugin can create a catalog on the fly
            // https://docs.gradle.org/current/userguide/platforms.html#sec:importing-published-catalog
            // so that we use it just like the catalog from libs.versions.toml but it may be a bit too much magic.
            properties("intellijPlatform.type"),
            properties("intellijPlatform.version"),
            properties("intellijPlatform.useInstaller").map { it.toBoolean() }
        ) {
            capabilities {
                // These too could be referenced using the generated catalog
                requireCapability("Tomcat")
                requireCapability("Java")
            }
        }
    }
}

Capabilities here are not "real" capabilities per se. Meaning that we use capabilities API only as a means to communicate to the Gradle plugin what sub-artifacts we want to be included. Then the plugin, depending on which capabilities were requested dynamically registers a variant with exactly the same capabilities (so that Gradle does not complain) and corresponding jars included. It somewhat the other way around compared to how Gradle wants it to be, i.e. an artifact declares capabilities and consumers request them, but in our case we can only generate them the other way around or we will have to register an infinite number of variants.

It is very similar to the solution 5, but here we just use capabilities API instead of custom bundledPlugins(...). But it kind of looks nice, since it is very intuitive API.

The only question now is what other corner cases I do not know about.

In solutions 5 and 6 the Gradle's dependencies tree we will have just idea:ideaIU:version and nothing else, in the IDE it will have a list ob sub-nodes i.e. selected jars. It has some downsides, since if its just 1 dependency, now you can not fine-tune its sub-dependencies, unless such functionality is build into the design. But this downside from another point of view is an upside because this is how the real "production" environment for the IDE looks like. Plugins do not run with just 2-3 other plugins in their classpath. They live inside an IDE which has some "capabilities" or "plugins" enabled or not.

This may improve IDE performance because we will not be duplicating transitive Jars many times in the dependencies tree.

Another positive thing, we can control the order of all jars.

JojOatXGME commented 4 weeks ago

Solution 1 IDE should be split into parts and published as separate artifacts [...]

I also made this suggestion as part of #1696. I can imagine that making this work for all cases could take some time. For example, multiple teams at JetBrains might be affected as build pipelines of the different IDEs might need to be adjusted. Anyway, I think this would be the right direction for the long term.

AlexanderBartash commented 4 weeks ago

Solution 1 IDE should be split into parts and published as separate artifacts [...]

I also made this suggestion as part of #1696. I can imagine that making this work for all cases could take some time. For example, multiple teams at JetBrains might be affected as build pipelines of the different IDEs might need to be adjusted. Anyway, I think this would be the right direction for the long term.

Yeah, but technically literally anyone probably can use this plugin to generate separate artifacts from an IDE and e.g. publish them to maven local, then depend on them. But this plugin provides a few other features related to running the IDE, testing etc, which we still are going to need.

AlexanderBartash commented 4 weeks ago

Solution 1 IDE should be split into parts and published as separate artifacts [...]

I also made this suggestion as part of #1696. I can imagine that making this work for all cases could take some time. For example, multiple teams at JetBrains might be affected as build pipelines of the different IDEs might need to be adjusted. Anyway, I think this would be the right direction for the long term.

Yeah, but technically literally anyone probably can use this plugin to generate separate artifacts from an IDE and e.g. publish them to maven local, then depend on them. But this plugin provides a few other features related to running the IDE, testing etc, which we still are going to need.

Interesting, this actually suggests another solution, similar to 4. A sub-plugin could be created with a sole purpose download a zip, split it into artifacts and do a publish to maven local. Then the main plugin will declare dependencies on that. Problem solved.

JojOatXGME commented 4 weeks ago

FYI, I just created an issue at the Gradle project, since all the straightforward solutions would require changes on their site.

AlexanderBartash commented 4 weeks ago

FYI, I just created an issue at the Gradle project, since all the straightforward solutions would require changes on their site.

Good idea. Solutions 1-3 could be done without changes in Gradle. 4 is possible too, but probably only if that other plugin is used separately to prepare the environment for the build.