crocs-muni / sec-certs

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).
https://sec-certs.org
MIT License
9 stars 7 forks source link

Add rudimentary profiling to processing pipeline #288

Closed J08nY closed 11 months ago

J08nY commented 1 year ago

From https://github.com/crocs-muni/sec-certs/pull/275#discussion_r1038820309:

logger.info("Extracting report keywords")

Log entries like this could be replaced with some elegant way of tracking how long these stages and steps of processing take. Like a context manager that:

J08nY commented 1 year ago

Here is some manually extracted data from a full CC run on the server. Commit: 6448911bb5872feb281b0151d63c54eeeb887cc7 Total duration: 9h:36m:15s

What When Length
Initial CSV/HTML download + process 2023-04-26 13:43:03,860 0h0m
CPEDataset from JSON 2023-04-26 13:43:43,700 0h1m
CVEDataset from JSON 2023-04-26 13:44:01,253 0h0m
PPDataset 2023-04-26 13:44:32,354 0h0m
MU dataset - download reports 2023-04-26 13:44:32,637 0h3m
MU dataset - download targets 2023-04-26 13:47:36,796 0h4m
MU dataset - convert reports 2023-04-26 13:51:33,456 0h5m
MU dataset - convert targets 2023-04-26 13:56:41,531 0h11m
MU dataset - extract report meta 2023-04-26 14:07:21,226 0h0m
MU dataset - extract target meta 2023-04-26 14:07:23,571 0h0m
MU dataset - extract report frontpage 2023-04-26 14:07:54,051 0h0m
MU dataset - extract target frontpage 2023-04-26 14:07:56,402 0h0m
MU dataset - extract report keywords 2023-04-26 14:08:03,717 0h0m
MU dataset - extract target keywords 2023-04-26 14:08:29,043 0h6m
CC scheme pages 2023-04-26 14:14:40,720 0h15m
download reports 2023-04-26 14:29:18,751 0h33m
download targets 2023-04-26 15:02:33,414 0h38m
convert reports 2023-04-26 15:40:24,238 2h27m
convert targets 2023-04-26 18:07:29,028 3h9m
extract report meta 2023-04-26 21:18:53,521 0h3m
extract target meta 2023-04-26 21:21:46,177 0h7m
extract report frontpage 2023-04-26 21:28:41,745 0m1m
extract target frontpage 2023-04-26 21:29:45,351 0h2m
extract report keywords 2023-04-26 21:31:30,754 0h16m
extract target keywords 2023-04-26 21:47:04,355 1h1m
heuristics - cert_id 2023-04-26 22:48:02,540 0h0m
heuristics - cpe match 2023-04-26 22:48:02,729 0h6m
heuristics - cve 2023-04-26 22:54:21,816 0h2m
heuristics - references 2023-04-26 22:56:16,638 0h0m
heuristics - transitive vulns 2023-04-26 22:56:18,026 0h0m
heuristics - cert labs 2023-04-26 22:56:38,557 0h0m
heuristics - SARs 2023-04-26 22:56:38,622 0h23m
End 2023-04-26 23:19:19,853

Some numbers:

The resulting dataset has 5326 certificates.
In total, we identified 22546 vulnerabilities in 367 vulnerable certificates.
There were total of 151 certificates skipped due to duplicity

The biggest culprits in the runtime are the OCR in our pdf to text conversion and the download from CC pages.