Switch from NVD json feeds to API

adamjanovsky commented 1 year ago

Closes #324

TODO

[x] Resolve matching to CVEs with configurations -- currently, CVEDataset only contains matching criteria for such configurations. Special look up dictionary must be built to address this
[x] Check that the workflow for fetching the datasets, for CPE and CVE matching actually works
[x] ~~Start using cached CPEs again~~ (makes no sense since CVEDataset no longer works with them)
[ ] Apply one full run for sanity check. Compare number of detected CPEs and CVEs
- Just download certs, process auxiliary datasets, compute_cpe_heuristics, compte_related_cves
[x] Unify logging during downloads and heuristics processiing
[x] Rewrite old tests to account for new fields in the objects
[x] Write new tests to test the dataset builder
[x] Run all notebooks -- update necessary methods
[x] Update docs with the NVD API key description. Describe how the data is being pulled, etc.
[x] Investigate RoCA CVEs and also other cases, see note below.
[x] Write docstrings
[x] Profile import times. Can we improve?

Also, it may be valuable to put up a list of expected CVEs and there matches. Maybe we could collect it on Trello. I don't think that we want to run these tests on each commit (so I'll disable them in CI/CD), but it may be good idea to run them when touching CVE/CPE matching.

Endpoints to use:

CPE API to recover all CPEs: https://services.nvd.nist.gov/rest/json/cpes/2.0
CVE API to recover all CVEs: https://services.nvd.nist.gov/rest/json/cves/2.0
CPEMatch API to map CPE criteria to CPE names: https://services.nvd.nist.gov/rest/json/cpematch/2.0

New tests

[x] Some requests with API handler
[x] Matching of complex criteria
[x] Prunning to CPEs of interest
[ ] Parsing dictionary of vulnerable configurations
[ ] CVEDataset correctly handles criteria configuration
[x] Datasets can be downloaded from seccerts.org

Notes:

This doesn't seem to hurt peak RAM usage, we still peak at ~8GB when CVE matching
Serialized datasets can take up to 2GB uncompressed. Compression ratio is approx. 10.
In total, we identified 21060 vulnerabilities in 348 vulnerable certificates.
The following snippet runs approx 26 minutes on my laptop (all is downloaded)

cc_dset = CCDataset(root_dir="/Users/adam/phd/projects/certificates/sec-certs/datasets/cc")
cc_dset.get_certs_from_web()
cc_dset._prepare_cpe_dataset()
cc_dset._prepare_cve_dataset()
cc_dset._prepare_cpe_match_dict()
cc_dset.compute_cpe_heuristics()
cc_dset.compute_related_cves()

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 74.42% and project coverage change: +0.84 :tada:

Comparison is base (5893352) 76.61% compared to head (d4825d1) 77.44%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #328 +/- ## ========================================== + Coverage 76.61% 77.44% +0.84% ========================================== Files 51 52 +1 Lines 6372 6572 +200 ========================================== + Hits 4881 5089 +208 + Misses 1491 1483 -8 ``` | [Impacted Files](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni) | Coverage Δ | | |---|---|---| | [src/sec\_certs/sample/fips.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9zYW1wbGUvZmlwcy5weQ==) | `86.34% <0.00%> (-0.27%)` | :arrow_down: | | [src/sec\_certs/utils/pandas.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy91dGlscy9wYW5kYXMucHk=) | `0.00% <ø> (ø)` | | | [src/sec\_certs/dataset/dataset.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9kYXRhc2V0L2RhdGFzZXQucHk=) | `52.22% <21.43%> (-9.34%)` | :arrow_down: | | [src/sec\_certs/dataset/cpe.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9kYXRhc2V0L2NwZS5weQ==) | `73.98% <64.11%> (+18.71%)` | :arrow_up: | | [src/sec\_certs/utils/nvd\_dataset\_builder.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy91dGlscy9udmRfZGF0YXNldF9idWlsZGVyLnB5) | `82.68% <82.68%> (ø)` | | | [src/sec\_certs/sample/cve.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9zYW1wbGUvY3ZlLnB5) | `84.04% <85.30%> (+32.25%)` | :arrow_up: | | [src/sec\_certs/dataset/cve.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9kYXRhc2V0L2N2ZS5weQ==) | `91.09% <85.49%> (+6.58%)` | :arrow_up: | | [src/sec\_certs/sample/cpe.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9zYW1wbGUvY3BlLnB5) | `91.51% <89.84%> (-1.08%)` | :arrow_down: | | [src/sec\_certs/serialization/json.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9zZXJpYWxpemF0aW9uL2pzb24ucHk=) | `84.91% <90.48%> (+0.70%)` | :arrow_up: | | [src/sec\_certs/configuration.py](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni#diff-c3JjL3NlY19jZXJ0cy9jb25maWd1cmF0aW9uLnB5) | `92.46% <100.00%> (+0.79%)` | :arrow_up: | | ... and [6 more](https://codecov.io/gh/crocs-muni/sec-certs/pull/328?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni) | | ... and [6 files with indirect coverage changes](https://codecov.io/gh/crocs-muni/sec-certs/pull/328/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni) Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=crocs-muni)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

adamjanovsky commented 1 year ago

@J08nY Could you pls expose CPEDataset,CVEDataset` and json of CPE Match feed somewhere on seccerts.org in compressed form?

The URLs are in the settings: https://github.com/crocs-muni/sec-certs/blob/fc638a859741a6cd8096c23789fc8ddff5236272/src/sec_certs/configuration.py#L72-L80, feel free to change them as you find fitting.

The CPEDataset and CVEDataset instances can be compressed with to_json(compress=True). CPEMatch feed is just a json, so it has to be handled separately.

Basically, now we just have to decide the URLs. Can you do that and change settings keys accordingly?

J08nY commented 1 year ago

Where do I get the CPE match feed? What processing do I need to do to obtain it?

adamjanovsky commented 1 year ago

Where do I get the CPE match feed? What processing do I need to do to obtain it?

If you have processed dataset available, the json should sit in auxiliary_datasets directory. Otherwise, you can obtain it with _prepare_cpe_match_dict(): https://github.com/crocs-muni/sec-certs/blob/d3d470ed408fde3638d26b49a7f6a403fe57c7e9/src/sec_certs/dataset/dataset.py#L400

You can either copy the contents of the method, or just create new dataset at some path and call the method right away. E.g.,

from sec_certs.dataset import CCDataset
cc_dset = CCDataset(root_dir="/whatever/path")
cpe_match_dict = cc_dset._prepare_cpe_match_dict()

with gzip.open("/path/to/store/cpe_match_dict.json", "w") as handle:
    json_str = json.dumps(cpe_match_dict, indent=4)
    handle.write(json_str.encode("utf-8"))

To get the datasets from NVD, you need to obtain the NVD API key and set the following two keys in your yaml settings:

nvd_api_key: <actual-api-key>
preferred_source_nvd_datasets: "api"

adamjanovsky commented 1 year ago

@J08nY

Regarding import time optimization, this post has a nice summary of different approachis that you can use to adress this: https://adamj.eu/tech/2023/03/02/django-profile-and-improve-import-time/

I did some profiling. As of now:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.dataset'
python -c 'import sec_certs.dataset'  3.28s user 0.54s system 111% cpu 3.413 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample' 
python -c 'import sec_certs.sample'  1.79s user 0.34s system 125% cpu 1.700 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model' 
python -c 'import sec_certs.model'  3.38s user 0.53s system 111% cpu 3.493 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 93% cpu 0.041 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs'      
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.043 total

I deferred few imports, see: 88f4630

Profiling after:

(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.datas
et'
python -c 'import sec_certs.dataset'  1.48s user 0.28s system 131% cpu 1.343 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.sample'
python -c 'import sec_certs.sample'  1.47s user 0.29s system 131% cpu 1.336 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.model'
python -c 'import sec_certs.model'  1.50s user 0.29s system 131% cpu 1.365 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils'  0.03s user 0.01s system 92% cpu 0.044 total
(venv) ~/phd/projects/certificates/sec-certs  $ time python  -c 'import sec_certs'    
python -c 'import sec_certs'  0.03s user 0.01s system 93% cpu 0.041 total

So, from 3.3 seconds we go to 1.5. Any further reduction would require:

Deferring numpy (doable, but needs some thinking, saves 0.1s or 10%)
Deferring BS4 (doable, but nneeds some thinking, saves 0.1 or 10%)
Deferring pandas (undoable, used in typing quite a bit, saves 0.4 or 30%)
Ditching imports from __init__.py files, they eat approx. 35% on their own, i.e., without pandas etc.

I did the profiling with python -X importtime yourfile.py 2> import.log and https://pypi.org/project/tuna/.

I consider this to be an OK result and I will invest no more effort into this unless you promote the issue.

Edit: Also note that the imports called from functions should be called only once AFAIK.

crocs-muni / sec-certs