fasten-project / vulnerability-producer

Gathers, enriches and publishes vulnerability information to a Kafka topic.
https://www.fasten-project.eu/
Apache License 2.0
6 stars 3 forks source link

Use Multi-threading in Vulnerability Producer #121

Closed mir-am closed 2 years ago

mir-am commented 2 years ago

This PR addresses #120 by using Executors and ParallelStream for gathering and parsing vulnerability data from multiple sources. The speed gain is up to 20-30 times, which is quite significant.

mir-am commented 2 years ago

There are still some possible improvements that I need to push before reviewing.

MagielBruntink commented 2 years ago

Amir, thanks. Some parts really could use the speedily. However we need to be careful about hitting rate limits in the APIs we use. Let me know when you need a full code review.

mir-am commented 2 years ago

Amir, thanks. Some parts really could use the speedily. However we need to be careful about hitting rate limits in the APIs we use. Let me know when you need a full code review.

For GH's GraphQL with API rate limit, it still uses one thread.

MagielBruntink commented 2 years ago

The main rate limit we will run into is the NVD one; if we do a parallel stream there to fetch CVEs it will rate limit within 1 minute. I have tried this before. We could also try another main source of CVEs, like CIRCL?

mir-am commented 2 years ago

The main rate limit we will run into is the NVD one; if we do a parallel stream there to fetch CVEs it will rate limit within 1 minute. I have tried this before. We could also try another main source of CVEs, like CIRCL?

For NVD, we can set one thread. Their dataset files are pretty small. Besides, GHParser and ExtraParser threads take longer to process.

mir-am commented 2 years ago

The PR is ready for review. I don't think there is room for further optimizations considering API rate limits. Now, the producer gathers and processes around ~200K vulnerability statements in an hour or two, which is much faster than the single-thread version.

mir-am commented 2 years ago

@MagielBruntink I highly appreciate it if you could review this PR.

MagielBruntink commented 2 years ago

I will, looking for time...

MagielBruntink commented 2 years ago

I get a test failure:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running eu.fasten.vulnerabilityproducer.db.NitriteControllerTest
09:44:12,986 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
09:44:12,986 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/c:/Users/Magiel%20Bruntink/repos/vulnerability-producer/target/classes/logback.xml]
09:44:13,018 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.StatusListenerAction - Added status listener of type [ch.qos.logback.core.status.OnConsoleStatusListener]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.TimestampAction - Using current interpretation time, i.e. now, as time reference.
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.TimestampAction - Adding property to the context with key="byDay" and value="20220321T094413" to the LOCAL scope
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT-ERROR]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILE]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.FileAppender[FILE] - File property is set to [/mnt/fasten/vulnerabilities/producer_logs/out-20220321T094413.log]
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT-ERROR] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILE] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.internal.storage.file.FileSnapshot] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.transport.PacketLineIn] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.transport.PacketLineOut] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.util.FS] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.cluster] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.connection] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.protocol.command] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.dizitart.no2.internals.DataService] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.clients.Metadata] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.clients.NetworkClient] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.common.metrics.Metrics] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.common.network.Selector] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@4b41dd5c - Registering current configuration as safe fallback point
[2022-03-21 09:44:13,205] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,205] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.516 s - in eu.fasten.vulnerabilityproducer.db.NitriteControllerTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.PoolTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in eu.fasten.vulnerabilityproducer.mappers.PoolTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.PurlMapperTest
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.503 s - in eu.fasten.vulnerabilityproducer.mappers.PurlMapperTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.VersionRangerTest
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.235 s - in eu.fasten.vulnerabilityproducer.mappers.VersionRangerTest
[INFO] Running eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest
[2022-03-21 09:44:14,099] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found MSR2020 CPP dataset from memory
[2022-03-21 09:44:14,099] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2009-1194
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found a total of 1 vulnerability in the CPP dataset from MSR2020
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2017-4971
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2018-1000134
[2022-03-21 09:44:14,131] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing statements for GEM: curl
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2013-2617
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed 1 from ExtraSources
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Downloading raw CSV file from MSR2020
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Writing dataset to memory
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2009-1194
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found a total of 1 vulnerability in the CPP dataset from MSR2020
[2022-03-21 09:44:14,193] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-TEST-2
[2022-03-21 09:44:14,193] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-TEST-3
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing java statement file 0001.yaml from year 2020
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-0001
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing python statement file 0002.yaml from year 2020
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-0002
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing statement file for CVE-2020-11989
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-11989
[ERROR] Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.141 s <<< FAILURE! - in eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest
[ERROR] testInjectFromSafetyDB  Time elapsed: 0.047 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <Vulnerability{id='mock-id', purls=[pkg:pypi/mock@0.9, pkg:pypi/mock@1.0], first_patched=[], scoreCVSS2=null, scoreCVSS3=null, severity=null, published_date='null', last_modified_date='null', description='mock description', references=[], patch_links=[], exploits=[], patches=[]}> but was: <null>
    at eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest.testInjectFromSafetyDB(ExtraParserTest.java:109)
mir-am commented 2 years ago

I get a test failure:

I've fixed the issue. The unit tests should pass now.

MagielBruntink commented 2 years ago

Amir, I ran the multi-threaded vulnerability-produces locally, and it ran into the Github API rate limit within 10 minutes... I cannot get into GitHub at the moment actually. The number of workers really needs to be toned done otherwise this will get all our computers black listed.

On Mon, Mar 21, 2022 at 10:47 AM Amir M. Mir @.***> wrote:

I get a test failure:

I've fixed the issue. The unit tests should pass now.

— Reply to this email directly, view it on GitHub https://github.com/fasten-project/vulnerability-producer/pull/121#issuecomment-1073692305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3XT4BZB7VJW7LXXGDCBMLVBBASPANCNFSM5QZH3XHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mir-am commented 2 years ago

Amir, I ran the multi-threaded vulnerability-produces locally, and it ran into the Github API rate limit within 10 minutes... I cannot get into GitHub at the moment actually. The number of workers really needs to be toned done otherwise this will get all our computers black listed. On Mon, Mar 21, 2022 at 10:47 AM Amir M. Mir @.> wrote: I get a test failure: I've fixed the issue. The unit tests should pass now. — Reply to this email directly, view it on GitHub <#121 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3XT4BZB7VJW7LXXGDCBMLVBBASPANCNFSM5QZH3XHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.>

For GH advisory data, we still use one thread like before. However, I believe that you might be blocked due to sending too many requests for parsing GH patch commits. @MagielBruntink, How many CPU cores does your machine have? Mine has 6 cores/12 threads and my machine has not been blocked after testing the producer dozens of times. Apparently, 6 cores are the sweet spot for parallelStream that can be set in the code.

MagielBruntink commented 2 years ago

Amir, not sure what's the difference then. I have a 8/16 core machine. Usage of GH API is not only for the GHSA data though, also in parsing issues & commits.

mir-am commented 2 years ago

Amir, not sure what's the difference then. I have a 8/16 core machine. Usage of GH API is not only for the GHSA data though, also in parsing issues & commits.

Fair point. I have added a CLI arg -nt to set no. of threads to use for finding and parsing patch information. This allows limiting the number of requests sent to GitHub or other services.

MagielBruntink commented 2 years ago

I'm doing a run now with -nt 1.