aboutcode-org / vulnerablecode

A free and open vulnerabilities database and the packages they impact. And the tools to aggregate and correlate these vulnerabilities. Sponsored by NLnet https://nlnet.nl/project/vulnerabilitydatabase/ for https://www.aboutcode.org/ Chat at https://gitter.im/aboutcode-org/vulnerablecode Docs at https://vulnerablecode.readthedocs.org/
https://public.vulnerablecode.io
Apache License 2.0
534 stars 201 forks source link

Track Java "shaded"/uberjarred/jarjared hidden deps vulnerabilities #1266

Open pombredanne opened 1 year ago

pombredanne commented 1 year ago

See:

@jensdietrich I read "Those projects were detected with a research tool our team has developed" ... is this open source?

See also https://github.com/ctcpip/java-shaded-example as a example by @ctcpip

jensdietrich commented 1 year ago

@pombredanne thanks for reaching out -- the software is not yet open source but will be once our paper has been published. There is a preview here: https://arxiv.org/abs/2306.05534 . Some of the vulnerabilities found have led to changes in the GHSA, such as https://github.com/github/advisory-database/pull/2258 (has references to a few more).

I am pretty open-minded about making the tool available (before officially open-sourcing it, or perhaps even fast-tracking this) and working with your project, it would be useful for us (there are two more team members) if you could describe how you would integrate and use this.

pombredanne commented 1 year ago

@jensdietrich And thank you for taking the time to reply! At a high level, we are providing an open source SCA solutions backed by open data.

There are two sides to the problem at hand of shared JARs: detecting that a JAR shades other packages and which ones and in general when a package embeds another packages beyond JARs, and reporting the corresponding vulnerabilities if any.

  1. For the detection side, we want to be able to detect shaded JARs in ScanCode and using the PurlDB index as needed. One approach is to map the binaries of JAR to its corresponding source code as we do in this pipeline https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/deploy_to_develop.py : the parts that are not mapped to the corresponding sources are therefore from another origin that can be matched in the PurlDB. Another approach is matching so find embedded code, and when this is implies rewritten bytecode, using various "features" that we could extract from these such as symbols, code graphs, or a decompilation to abstract the class paths. Or your approach!

  2. For the vulnerability side, this would likely be resolved with a VulnerableCode improver that would using the results of 1. above, doing by a matching or lookup by PURL in PurlDB from VulnerableCode and would be updating the VulnerableCode DB to update the embedding package with the vulnerabilities affecting its embedded package(s).

So 1. would be integrated in ScanCode, ScanCode.io and PurlDB, and 2. in VulnerableCode

jensdietrich commented 1 year ago

@pombredanne Thanks for the clarification. In our approach, we associate bytecode with sources using the Maven Rest API . This enables us to perform matching on the source level, using a custom AST analysis (that ignores package names, as they usually change during shading). But to precisely detect vulnerabilities (i.e. avoid false positives) we rely on a proof-of-vulnerability project for each CVE that makes it testable. Those projects then get instantiated for clones, and the respective tests are executed to check whether the vulnerability is still present. There are a few POVs we collected here for evaluation purposes: https://github.com/jensdietrich/xshady/. POVs can often be created from patches, but the process cannot be completely automated IMO and may therefore not scale sufficiently for you. Also, our approach does miss some clones, for instance, we limit the number of REST queries for performance reasons.

We are also working on matching binaries -- a dual setup with an old-fashioned engineered solution, and a dataset that might be suitable to train a classifier to match binaries. Focus atm is on the dataset. Work is in progress, we hope to have something by the end of the year. Being an academic and teaching makes things slow. Our main use case here is not vulnerability detection, but reproducible builds.

I read that NLnet sponsors some projects in this space, perhaps we could apply for this to hire students to help with this.

Re making our tool available -- I will discuss this with my collaborators. It will take a few days through.