DependencyTrack / dependency-track

Dependency-Track is an intelligent Component Analysis platform that allows organizations to identify and reduce risk in the software supply chain.
https://dependencytrack.org/
Apache License 2.0
2.65k stars 562 forks source link

The MavenMetaAnalyzer task fails due to invalid URLs #3566

Open mikael-carneholm-2-wcar opened 7 months ago

mikael-carneholm-2-wcar commented 7 months ago

Current Behavior

In the logs, I can see that the MavenMetaAnalyzer task tails due to invalid URLs formatted with parts of the PURL of a component:

compose-dtrack-apiserver-1  | 2024-03-18 08:51:29,133 INFO [InternalAnalysisTask] Starting internal analysis task
compose-dtrack-apiserver-1  | 2024-03-18 08:51:29,133 INFO [InternalAnalysisTask] Analyzing 171 component(s)
compose-dtrack-apiserver-1  | [Fatal Error] :1:10: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
compose-dtrack-apiserver-1  | 2024-03-18 08:51:31,639 ERROR [MavenMetaAnalyzer] Request failure
compose-dtrack-apiserver-1  | org.xml.sax.SAXParseException: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
compose-dtrack-apiserver-1  |   at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
compose-dtrack-apiserver-1  |   at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
compose-dtrack-apiserver-1  |   at java.xml/javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
compose-dtrack-apiserver-1  |   at org.dependencytrack.tasks.repositories.MavenMetaAnalyzer.analyze(MavenMetaAnalyzer.java:86)
compose-dtrack-apiserver-1  |   at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.analyze(RepositoryMetaAnalyzerTask.java:177)
compose-dtrack-apiserver-1  |   at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.lambda$analyze$0(RepositoryMetaAnalyzerTask.java:121)
compose-dtrack-apiserver-1  |   at io.github.resilience4j.retry.Retry.lambda$decorateCallable$5(Retry.java:237)
compose-dtrack-apiserver-1  |   at io.github.resilience4j.retry.Retry.executeCallable(Retry.java:373)
compose-dtrack-apiserver-1  |   at org.dependencytrack.util.CacheStampedeBlocker.readThroughOrPopulateCache(CacheStampedeBlocker.java:201)
compose-dtrack-apiserver-1  |   at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.analyze(RepositoryMetaAnalyzerTask.java:126)
compose-dtrack-apiserver-1  |   at org.dependencytrack.tasks.repositories.RepositoryMetaAnalyzerTask.inform(RepositoryMetaAnalyzerTask.java:91)
compose-dtrack-apiserver-1  |   at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:110)
compose-dtrack-apiserver-1  |   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
compose-dtrack-apiserver-1  |   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
compose-dtrack-apiserver-1  |   at java.base/java.lang.Thread.run(Unknown Source)

(NB: The DOCTYPE probably stems from a plain HTTP response for a 404 page, but this is just a guess since the URL isn't logged)

It is however impossible to know which component(s) that cause this since the component name isn't logged in the analyze() method. If that would have been logged, one could have inspected+corrected the PURL of the component in the DB and error-traced the chain that led to the invalid PURL.

My suggestion is that:

  1. Something along the lines of "Analyzing component " + component gets logged in the analyze() method, for traceability
  2. The URL is validated before it gets passed to processHttpRequest

Steps to Reproduce

Hard to specify, since DTrack doesn't log which component is the root of the cause.

Expected Behavior

  1. The generated URL gets validated before it gets used. If invalid, a warning along the lines of "Invalid url: " + url gets logged
  2. Each time the MavenMetaAnalyzer.analyze() method is called, "Analyzing " + component is logged for traceability

Dependency-Track Version

4.10.1

Dependency-Track Distribution

Container Image

Database Server

PostgreSQL

Database Server Version

13.13

Browser

N/A

Checklist

nscuro commented 7 months ago

Related to #3234.

I already added MDC usage to the new BomUploadProcessingTaskV2, we merely need to continue adding MDC wherever it makes sense.

https://github.com/DependencyTrack/dependency-track/blob/333c56d44a7db3447bb1e7126a05b8df6ea717b1/src/main/java/org/dependencytrack/tasks/BomUploadProcessingTaskV2.java#L148-L151

The benefit of using MDC is that it will attach the context variables to all log statements within its scope.

I'm thinking that, specifically for the repository meta analysis, we also want to include the name of the repository to which the request is made.