Calculating hashes vs using from maven cache

prabhu commented 4 months ago

Noticed that the hashes are calculated based on the real file instead of using existing information from the maven cache.

https://github.com/CycloneDX/cyclonedx-maven-plugin/blob/51bde9e135ded7248e629ec6ebc609dc5d397a20/src/main/java/org/cyclonedx/maven/DefaultModelConverter.java#L164

Further, I believe there might be a few problems with the hashing logic used.

try (InputStream fis = new BufferedInputStream(Files.newInputStream(file.toPath()),bufSize)) {
            final byte[] buf = new byte[bufSize];
            while (fis.available() > 0) {
                final int read = fis.read(buf);
                digests.stream().parallel().forEach(d-> d.update(buf, 0, read));
            }
        }

There is no encoding specified, so not sure if this affects the data that gets read.
Files.newInputStream appears to be non-locking as per the docs.
parallel method is used which is an intermediate operation as per the documentation.

Is it possible that incorrect data might get read, if multiple build and file copy operations also happen in parallel?

I am not very knowledgeable in Java, so please correct me if I am wrong.

ppkarwasz commented 4 months ago

@prabhu,

Maven Central mostly relies on MD5 and SHA1 hashes. These are the only two algorithms required by OSSRH (cf. OSSRH Requirements) and in practice they are the only ones published to Maven Central. This is why CycloneDX Maven plugin computes the hashes itself.

If you observe wrong results in a multi-module Maven build, these are most probably due to #410, not concurrency problems.

A long-standing problem in multi-module Maven builds is that it is really hard to execute some task at the end of the build, so usually the aggregate SBOM is generated based on the previous snapshots.

VinodAnandan commented 4 months ago

I recommend against using the Maven Central/external hash. Several Java frameworks modify the .class files. To accurately identify these modified jars, it is advisable to use locally computed hashes instead.

prabhu commented 4 months ago

@VinodAnandan, in that case the purl must be updated to represent the fact that such jars might be different to the one published in maven central.

I have tried to handle this case using evidence by comparing the generated hash with the one in the maven cache. Could you kindly test with my PR and let me know how it looks?

https://github.com/CycloneDX/cyclonedx-maven-plugin/pull/494

VinodAnandan commented 3 months ago

@prabhu Do you mean to capture that information similar to repository_url=localhost/.m2/ ? I believe we expect the default repository of Maven to be https://repo.maven.apache.org/maven2 . However, in many enterprise cases, the default repository will be their internal repositories (Nexus or JFrog) used by their build tool. @stevespringett, @pombredanne please correct me if I misunderstood. Perhaps the tool should compare the local JAR hashes with those at the actual download location. If they differ, then it could update the PURL repository URL to repository_url=localhost/.m2/ ?

@prabhu @stevespringett What are your thoughts on capturing modified change information in the pedigree part of the CycloneDX (https://cyclonedx.org/use-cases/#pedigree)? In a recent discussion with the Quarkus team (@aloubyansky), we (@nscuro and I) also discussed Quarkus modifying their JARs. I'm not sure if they have found a solution to capture these changes yet.

aloubyansky commented 3 months ago

Yes, we have all the info, of course. It's a matter of properly manifesting it. I'll share some examples soon.

aloubyansky commented 3 months ago

@prabhu,

Maven Central mostly relies on MD5 and SHA1 hashes. These are the only two algorithms required by OSSRH (cf. OSSRH Requirements) and in practice they are the only ones published to Maven Central. This is why CycloneDX Maven plugin computes the hashes itself.

If you observe wrong results in a multi-module Maven build, these are most probably due to #410, not concurrency problems.

A long-standing problem in multi-module Maven builds is that it is really hard to execute some task at the end of the build, so usually the aggregate SBOM is generated based on the previous snapshots.

This can be done by an implementing https://maven.apache.org/ref/3.5.0/apidocs/org/apache/maven/AbstractMavenLifecycleParticipant.html and configuring the plug-in as containing extensions.

hboutemy commented 3 months ago

There is no encoding specified, so not sure if this affects the data that gets read

a hash is always a hash of binary content: there never is encoding involved in hashing.

CycloneDX / cyclonedx-maven-plugin

Calculating hashes vs using from maven cache #493