cve-search / CveXplore

CveXplore
https://cve-search.github.io/CveXplore/
GNU General Public License v3.0
33 stars 16 forks source link

Avoid full file downloads in sources_process #270

Closed oh2fih closed 7 months ago

oh2fih commented 7 months ago

The nist.nvd_nist_api part is now brilliant for updating regularly with minimal steps, just downloading the new CPEs & CVEs from the API. On the other hand, for other sources that are not providing an API we are still downloading an entire file every time the database is updated – even though the data does not change that often (X's are not modified since the last update).

source URL Last-Modified Content-Length
cwe https://cwe.mitre.org/data/xml/cwec_latest.xml.zip Thu, 29 Feb 2024 13:58:32 GMT 1720673 (1.6 MB)
capec https://capec.mitre.org/data/xml/capec_latest.xml Tue, 24 Jan 2023 18:32:31 GMT 3849998 (3.7 MB)
via4 https://www.cve-search.org/feeds/via4.json Mon, 08 Apr 2024 09:10:58 GMT 94524266 (90.1 MB)
epss https://epss.cyentia.com/epss_scores-current.csv.gz
=> location: epss_scores-2024-04-08.csv.gz
Mon, 08 Apr 2024 09:14:57 GMT 1521405 (1.5 MB)

I'm suggesting saving the Last-Modified & Content-Length and first asking just HTTP HEAD of file (final destination of possible redirects; Python equivalent for curl -I -L). The update for that source should only start if either of these details has changed from the previous update. That would allow shortening the update intervals without increasing loads & traffic on the source servers.

Any thoughts on this approach & how the cached headers should be saved?

oh2fih commented 7 months ago

I was thinking of a rather simple solution, if you don't find it too quick and dirty:

Note to self:

P-T-I commented 7 months ago

This would be a nice addition! Regarding saving the header to file; we could save it to the info collection of the database as well; saves that hassle of dealing with files.

oh2fih commented 7 months ago

Thanks for the hint. Because the info collection already had this information cached we didn't even need additional caches, but could simply compare the value in there. That made adding the new functionality rather straightforward. Please review the pull request.

After this I can simply lower the update interval on CVE-Searche's SystemD timer from 2 hours to 1 hour, which would also be the same as the sleep time in the db_updater.py -v -l (loop mode). :+1: