fkie-cad / nvd-json-data-feeds

Community reconstruction of the legacy JSON NVD Data Feeds. This project uses and redistributes data from the NVD API but is neither endorsed nor certified by the NVD.
109 stars 15 forks source link

CVE-2023-34057 data is incomplete #13

Closed yann-morin-1998 closed 6 months ago

yann-morin-1998 commented 7 months ago

The data in CVE-2023-34057 differs between the content in the git repository, vs. the content in the daily per-year feeds:

I noticed just this one by pure chance; I haven't investigated whether there are others, or how pervasive the discrepancies might be.

arnout commented 7 months ago

I notice that on 2023-11-07, no update was done between 0:55 and 21:02, and that in the 21:02 update (commit b47d02a3332d3f9d1e436babc7b939291ea76ece) 35K CVEs were updated.

While experimenting with NVD APIv2 I noticed that it is not at all robust against race conditions. Both the paginated queries and the date-based queries may cause an update to be missed when NVD gets updated in the middle of (a series of) queries. With 35K updates, this is quite likely to occur.

I don't understand though how it is possible that the daily archive does have the correct information...

rhelmke commented 7 months ago

Hi together,

thank you @yann-morin-1998 for pointing this out. And thank you @arnout for your valuable additions. I found out what is going on and I truly believe the reason are race conditions during synchronization. (thanks again, @arnout, you saved me a lot of time)

First some facts: The data in the cache is consistent with the NVD. There are no duplicates, which we adressed in #10, and the version of CVE-2023-34057 that is inside the repo's file tree should've disappeared long ago. Since the release package holds the right data, but the source tree in this repo does not, the fault must be in the cache query used to flush the data to the file system. I think the issue becomes apparent when taking a look at the code.

This is a snippet from the code that creates release packages:

    begin_of_time: datetime = datetime.fromisoformat("1970-01-01T00:00:00.000+00:00")

    version: str
    sha: str

    with service.FeedRelease() as release:
        # Step 1: Create and compress all feeds

        # 1.1 CVE-YYYY.json.xz
        for year in range(1999, datetime.utcnow().year + 1):
            cve_per_year: list[dict] = [
                cve["cve"]
                for cve in opensearch_client.get_cves_by_year_within_mod_range(year, begin_of_time, exec_timestamp)
            ]
            release.create_feed_json_xz(f"CVE-{year}", cve_per_year, exec_timestamp)

This is the code that coordinates repo updates:

    logger.info(f"updating git repository by fetching the latest NVD data from OpenSearch")
    repo: GithubRepo = GithubRepo(opensearch_client, time_anchors)
    updated_cves: list[dict[str, Any]] = []

    # get the last commit starting with "Auto-Update: <ISO-TIMESTAMP>" this is the last
    # modified_since date.
    # The timestamp we use in commit messages equals to the highest `lastModified`
    # value in the updated CVE dataset. 
    modified_since: datetime = repo.last_auto_update_from_commit_history()

    logger.debug(f"last auto-update commit was at timestamp {modified_since}")

    # iterate CVE-year-XXXX, fetch all modified items, update repo
    for year in range(1999, datetime.utcnow().year + 1):
        cves_per_year: list[dict] = [
            cve
            for cve in opensearch_client.get_cves_by_year_within_mod_range(
                year, start=modified_since, stop=exec_timestamp
            )
        ]
        if cves_per_year:
            logger.info(f"updating {len(cves_per_year)} newly modified CVES from {year}:")
        for cve in cves_per_year:
            repo.update_cve_file(cve)
            updated_cves += [cve]

See the difference? begin_of_time. In the latter snippet, we get CVEs by year but try to minimize file I/O by only writing what has changed since the last commit. Apparently, this is unreliable. Some updates must fall through the cracks at this point.

Here's my suggestion of what is going to change:

  1. I check the file/cache consistency of each CVE and let you guys know if there are any more occurences (something tells me there are)
  2. We'll decouple the file update mechanism from any timestamp. We'll calculate the hashsum for each cache/file-pair and flush any differences that occured since the last commit. This basically means that the naive assumption of reliable modification timestamps disappears.
  3. We'll establish more transparency by also documenting in the changelog why a CVE was updated -> Modification since last commit / hash differs in the file system.

I'm open to any suggestions :-)

rhelmke commented 6 months ago

Alright, after testing over the past few days, I've deployed a new version of the synchronization bot :-).

It now works as desribed above. I just triggered a manual repo/cache synchronization and as of commit 34e34f7, all inconsistencies should be gone.

We now also generate a _state.csv file, which gets updated each commit and emits the cache state for each cve. The csv header is:

cve,new,changed,sha256,lastModifiedNVD

Where new and changed are bools (0/1) that mark new and changed CVEs for a commit. While, of course, one could use the git history to reconstruct this data, I thought it would add some conveniency. The added 25 megabytes of csv data is worth it IMHO.

As for previous inconsistencies between cache and repo files, I identified 452 CVEs. Out of these, 4 were missing in the repo and the remaining 448 didn't receive an update. Starting with the last forced synchronization, these inconsistencies should disappear. Here's the raw data. The format should be self-explanatory :-).

CVE-2023-34057 should now be fixed, too :-).

We hope the recent changes properly address this issue and apologize for the bug :-)

Cheers