Closed adamjanovsky closed 1 year ago
Patch coverage: 74.42
% and project coverage change: +0.84
:tada:
Comparison is base (
5893352
) 76.61% compared to head (d4825d1
) 77.44%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@J08nY Could you pls expose CPEDataset,
CVEDataset` and json of CPE Match feed somewhere on seccerts.org in compressed form?
The URLs are in the settings: https://github.com/crocs-muni/sec-certs/blob/fc638a859741a6cd8096c23789fc8ddff5236272/src/sec_certs/configuration.py#L72-L80, feel free to change them as you find fitting.
The CPEDataset
and CVEDataset
instances can be compressed with to_json(compress=True)
. CPEMatch feed is just a json, so it has to be handled separately.
Basically, now we just have to decide the URLs. Can you do that and change settings keys accordingly?
Where do I get the CPE match feed? What processing do I need to do to obtain it?
Where do I get the CPE match feed? What processing do I need to do to obtain it?
If you have processed dataset available, the json should sit in auxiliary_datasets
directory. Otherwise, you can obtain it with _prepare_cpe_match_dict()
: https://github.com/crocs-muni/sec-certs/blob/d3d470ed408fde3638d26b49a7f6a403fe57c7e9/src/sec_certs/dataset/dataset.py#L400
You can either copy the contents of the method, or just create new dataset at some path and call the method right away. E.g.,
from sec_certs.dataset import CCDataset
cc_dset = CCDataset(root_dir="/whatever/path")
cpe_match_dict = cc_dset._prepare_cpe_match_dict()
with gzip.open("/path/to/store/cpe_match_dict.json", "w") as handle:
json_str = json.dumps(cpe_match_dict, indent=4)
handle.write(json_str.encode("utf-8"))
To get the datasets from NVD, you need to obtain the NVD API key and set the following two keys in your yaml settings:
nvd_api_key: <actual-api-key>
preferred_source_nvd_datasets: "api"
@J08nY
Regarding import time optimization, this post has a nice summary of different approachis that you can use to adress this: https://adamj.eu/tech/2023/03/02/django-profile-and-improve-import-time/
I did some profiling. As of now:
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.dataset'
python -c 'import sec_certs.dataset' 3.28s user 0.54s system 111% cpu 3.413 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.sample'
python -c 'import sec_certs.sample' 1.79s user 0.34s system 125% cpu 1.700 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.model'
python -c 'import sec_certs.model' 3.38s user 0.53s system 111% cpu 3.493 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils' 0.03s user 0.01s system 93% cpu 0.041 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs'
python -c 'import sec_certs' 0.03s user 0.01s system 93% cpu 0.043 total
I deferred few imports, see: 88f4630
Profiling after:
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.datas
et'
python -c 'import sec_certs.dataset' 1.48s user 0.28s system 131% cpu 1.343 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.sample'
python -c 'import sec_certs.sample' 1.47s user 0.29s system 131% cpu 1.336 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.model'
python -c 'import sec_certs.model' 1.50s user 0.29s system 131% cpu 1.365 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs.utils'
python -c 'import sec_certs.utils' 0.03s user 0.01s system 92% cpu 0.044 total
(venv) ~/phd/projects/certificates/sec-certs $ time python -c 'import sec_certs'
python -c 'import sec_certs' 0.03s user 0.01s system 93% cpu 0.041 total
So, from 3.3 seconds we go to 1.5. Any further reduction would require:
__init__.py
files, they eat approx. 35% on their own, i.e., without pandas etc.I did the profiling with python -X importtime yourfile.py 2> import.log
and https://pypi.org/project/tuna/.
I consider this to be an OK result and I will invest no more effort into this unless you promote the issue.
Edit: Also note that the imports called from functions should be called only once AFAIK.
Closes #324
TODO
Start using cached CPEs again(makes no sense sinceCVEDataset
no longer works with them)Endpoints to use:
New tests
Notes: