Closed mir-am closed 2 years ago
There are still some possible improvements that I need to push before reviewing.
Amir, thanks. Some parts really could use the speedily. However we need to be careful about hitting rate limits in the APIs we use. Let me know when you need a full code review.
Amir, thanks. Some parts really could use the speedily. However we need to be careful about hitting rate limits in the APIs we use. Let me know when you need a full code review.
For GH's GraphQL with API rate limit, it still uses one thread.
The main rate limit we will run into is the NVD one; if we do a parallel stream there to fetch CVEs it will rate limit within 1 minute. I have tried this before. We could also try another main source of CVEs, like CIRCL?
The main rate limit we will run into is the NVD one; if we do a parallel stream there to fetch CVEs it will rate limit within 1 minute. I have tried this before. We could also try another main source of CVEs, like CIRCL?
For NVD, we can set one thread. Their dataset files are pretty small. Besides, GHParser
and ExtraParser
threads take longer to process.
The PR is ready for review. I don't think there is room for further optimizations considering API rate limits. Now, the producer gathers and processes around ~200K vulnerability statements in an hour or two, which is much faster than the single-thread version.
@MagielBruntink I highly appreciate it if you could review this PR.
I will, looking for time...
I get a test failure:
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running eu.fasten.vulnerabilityproducer.db.NitriteControllerTest
09:44:12,986 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
09:44:12,986 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [file:/c:/Users/Magiel%20Bruntink/repos/vulnerability-producer/target/classes/logback.xml]
09:44:13,018 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.StatusListenerAction - Added status listener of type [ch.qos.logback.core.status.OnConsoleStatusListener]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.TimestampAction - Using current interpretation time, i.e. now, as time reference.
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.TimestampAction - Adding property to the context with key="byDay" and value="20220321T094413" to the LOCAL scope
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
09:44:13,033 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT-ERROR]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.FileAppender]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [FILE]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
09:44:13,049 |-INFO in ch.qos.logback.core.FileAppender[FILE] - File property is set to [/mnt/fasten/vulnerabilities/producer_logs/out-20220321T094413.log]
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT-ERROR] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [FILE] to Logger[ROOT]
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.internal.storage.file.FileSnapshot] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.transport.PacketLineIn] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.transport.PacketLineOut] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.eclipse.jgit.util.FS] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.cluster] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.connection] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.mongodb.driver.protocol.command] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.dizitart.no2.internals.DataService] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.clients.Metadata] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.clients.NetworkClient] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.common.metrics.Metrics] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [org.apache.kafka.common.network.Selector] to OFF
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
09:44:13,049 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@4b41dd5c - Registering current configuration as safe fallback point
[2022-03-21 09:44:13,205] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,205] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,221] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Manufacturing class java.lang.String with parameters []
[2022-03-21 09:44:13,236] [DEBUG] [main] [u.c.j.p.a.PodamFactoryImpl] - Populating pojo class java.lang.String
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.516 s - in eu.fasten.vulnerabilityproducer.db.NitriteControllerTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.PoolTest
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in eu.fasten.vulnerabilityproducer.mappers.PoolTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.PurlMapperTest
[INFO] Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.503 s - in eu.fasten.vulnerabilityproducer.mappers.PurlMapperTest
[INFO] Running eu.fasten.vulnerabilityproducer.mappers.VersionRangerTest
[INFO] Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.235 s - in eu.fasten.vulnerabilityproducer.mappers.VersionRangerTest
[INFO] Running eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest
[2022-03-21 09:44:14,099] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found MSR2020 CPP dataset from memory
[2022-03-21 09:44:14,099] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2009-1194
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found a total of 1 vulnerability in the CPP dataset from MSR2020
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2017-4971
[2022-03-21 09:44:14,115] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2018-1000134
[2022-03-21 09:44:14,131] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing statements for GEM: curl
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2013-2617
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed 1 from ExtraSources
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Downloading raw CSV file from MSR2020
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Writing dataset to memory
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2009-1194
[2022-03-21 09:44:14,177] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Found a total of 1 vulnerability in the CPP dataset from MSR2020
[2022-03-21 09:44:14,193] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-TEST-2
[2022-03-21 09:44:14,193] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-TEST-3
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing java statement file 0001.yaml from year 2020
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-0001
[2022-03-21 09:44:14,224] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing python statement file 0002.yaml from year 2020
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-0002
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsing statement file for CVE-2020-11989
[2022-03-21 09:44:14,240] [INFO ] [main] [e.f.v.u.p.ExtraParser] - Parsed Vulnerability with ID - CVE-2020-11989
[ERROR] Tests run: 9, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.141 s <<< FAILURE! - in eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest
[ERROR] testInjectFromSafetyDB Time elapsed: 0.047 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <Vulnerability{id='mock-id', purls=[pkg:pypi/mock@0.9, pkg:pypi/mock@1.0], first_patched=[], scoreCVSS2=null, scoreCVSS3=null, severity=null, published_date='null', last_modified_date='null', description='mock description', references=[], patch_links=[], exploits=[], patches=[]}> but was: <null>
at eu.fasten.vulnerabilityproducer.parsers.ExtraParserTest.testInjectFromSafetyDB(ExtraParserTest.java:109)
I get a test failure:
I've fixed the issue. The unit tests should pass now.
Amir, I ran the multi-threaded vulnerability-produces locally, and it ran into the Github API rate limit within 10 minutes... I cannot get into GitHub at the moment actually. The number of workers really needs to be toned done otherwise this will get all our computers black listed.
On Mon, Mar 21, 2022 at 10:47 AM Amir M. Mir @.***> wrote:
I get a test failure:
I've fixed the issue. The unit tests should pass now.
— Reply to this email directly, view it on GitHub https://github.com/fasten-project/vulnerability-producer/pull/121#issuecomment-1073692305, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3XT4BZB7VJW7LXXGDCBMLVBBASPANCNFSM5QZH3XHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Amir, I ran the multi-threaded vulnerability-produces locally, and it ran into the Github API rate limit within 10 minutes... I cannot get into GitHub at the moment actually. The number of workers really needs to be toned done otherwise this will get all our computers black listed. … On Mon, Mar 21, 2022 at 10:47 AM Amir M. Mir @.> wrote: I get a test failure: I've fixed the issue. The unit tests should pass now. — Reply to this email directly, view it on GitHub <#121 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3XT4BZB7VJW7LXXGDCBMLVBBASPANCNFSM5QZH3XHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.>
For GH advisory data, we still use one thread like before. However, I believe that you might be blocked due to sending too many requests for parsing GH patch commits.
@MagielBruntink, How many CPU cores does your machine have?
Mine has 6 cores/12 threads and my machine has not been blocked after testing the producer dozens of times.
Apparently, 6 cores are the sweet spot for parallelStream
that can be set in the code.
Amir, not sure what's the difference then. I have a 8/16 core machine. Usage of GH API is not only for the GHSA data though, also in parsing issues & commits.
Amir, not sure what's the difference then. I have a 8/16 core machine. Usage of GH API is not only for the GHSA data though, also in parsing issues & commits.
Fair point. I have added a CLI arg -nt
to set no. of threads to use for finding and parsing patch information. This allows limiting the number of requests sent to GitHub or other services.
I'm doing a run now with -nt 1
.
This PR addresses #120 by using
Executors
andParallelStream
for gathering and parsing vulnerability data from multiple sources. The speed gain is up to 20-30 times, which is quite significant.