database definitions do not match production-definition blobs

Description

The Cosmos DB definitions do not always match the production-definition blobs. This leads to missing licenses in the UI search result list and in the changes-notifications definitions.

Attempts at reducing out of sync problems

Updated consistency setting from Session to Strong
One of the two service apps was using an older process that is less efficient leading to a significant Out of Memory errors when writing to the database. Changed the config for this service app to use the more efficient process.

Approach to improve changes-notifications

The change notification script uses the database to find recent changes and gather the definition for each changed coordinate. As the production-definition blobs are the source of truth, the script will be updated to use these blob store definitions as the source for definitions recorded in the changes-notifications blob store.

Backfilling data

A script will be needed to find DB entries with missing licenses, compare them to the production-definition blob, and update the DB definition as needed. The script used for exploration can likely be modified to perform the update task.

This script can also be used to backfill the changes-notifications definitions to resync them with production-definition blobs at the same time as the resync for DB definitions. See https://github.com/clearlydefined/operations/issues/81 for more information.

NOTE: There may be other sync errors besides the missing license. It would be very expensive to touch every DB entry and compare it to the corresponding blob. Addressing just the ones missing licenses will have the biggest benefit at the lowest possible cost. The processing will still be extensive.

Related Work

https://github.com/clearlydefined/operations/issues/81

Adding some data from my deep dive investigation.

I wrote a script that queries the mongodb for all definitions that are missing the field license.declared. From there I took the coordinates from those and cross referenced 5000 of them from the blob store to compare results.

What I found was around ~10% of definitions without a license.declared field are out of sync with the blob store. The timestamps do not match and the blob store is updated very shortly after the database.

Here are some stats for 2024 around the data I pulled from the database.

{
  "2024-01": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 657,
      "total_documents": 84167,
      "total_estimated_invalid": 11059.5438,
      "sample_percent_of_total": 5.94
    }
  },
  "2024-02": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 227,
      "total_documents": 31828,
      "total_estimated_invalid": 1444.9912000000002,
      "sample_percent_of_total": 15.71
    }
  },
  "2024-03": {
    "stats": {
      "sample_total": 4436,
      "sample_invalid": 414,
      "total_documents": 4436,
      "total_estimated_invalid": 414,
      "sample_percent_of_total": 9.33
    }
  },
  "2024-04": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 346,
      "total_documents": 56627,
      "total_estimated_invalid": 3918.5884,
      "sample_percent_of_total": 8.83
    }
  },
  "2024-05": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 499,
      "total_documents": 116130,
      "total_estimated_invalid": 11589.774,
      "sample_percent_of_total": 4.31
    }
  },
  "2024-06": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 493,
      "total_documents": 61892,
      "total_estimated_invalid": 6102.5512,
      "sample_percent_of_total": 8.08
    }
  }
}

The script is here as follows:

You will need to set MONGO_CONNECTION_STRING either in a .env file or in the command line. This connection string should be read only and generated from the Azure cosmosDB dashboard.

script.py

import json
import os
import urllib.parse

import pymongo
import requests
from azure.cosmos import cosmos_client
from dotenv import load_dotenv

load_dotenv()

MONGO_CONNECTION_STRING = str(os.environ.get("MONGO_CONNECTION_STRING"))
DB_NAME = "clearlydefined"
COLLECTION_NAME = "definitions-trimmed"
BASE_AZURE_BLOB_URL = (
    "https://clearlydefinedprod.blob.core.windows.net/production-definition"
)

# Example blob URL
# https://clearlydefinedprod.blob.core.windows.net/production-definition/composer/packagist/00f100/fcphp-cache/revision/0.1.0.json

# Mongo document with unused fields removed
# {
#   "_id": "composer/packagist/00f100/fcphp-cache/0.1.0",
#   "_meta": {
#     "schemaVersion": "1.6.1",
#     "updated": "2019-08-29T02:06:54.498Z"
#   },
#   "coordinates": {
#     "type": "composer",
#     "provider": "packagist",
#     "namespace": "00f100",
#     "name": "fcphp-cache",
#     "revision": "0.1.0"
#   },
#   "licensed": {
#     "declared": "MIT",
#     "toolScore": {
#       "total": 17,
#       "declared": 0,
#       "discovered": 2,
#       "consistency": 0,
#       "spdx": 0,
#       "texts": 15
#     },
#     "score": {
#       "total": 17,
#       "declared": 0,
#       "discovered": 2,
#       "consistency": 0,
#       "spdx": 0,
#       "texts": 15
#     }
#   }
# }

def fetch_blob(base_url, type, provider, namespace, name, revision):
    """Fetch the blob from the azure blob storage"""
    # need to encode the url for the %2f characters
    url = urllib.parse.quote(
        f"{type}/{provider}/{namespace}/{name}/revision/{revision}.json".lower()
    )
    url = f"{base_url}/{url}"
    # Fetch the data from the blob storage
    res = requests.get(url)
    if res.status_code != 200:
        return {}
    return res.json()

def dump_data(data, filename):
    with open(filename, "w") as f:
        json.dump(data, f)

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)

db = client[DB_NAME]
if DB_NAME not in client.list_database_names():
    print(f"Database '{DB_NAME}' not found.")
else:
    print(f"Using database: '{DB_NAME}'.")

collection = db[COLLECTION_NAME]
if COLLECTION_NAME not in db.list_collection_names():
    print(f"Collection '{COLLECTION_NAME}' not found.")
else:
    print(f"Using collection: '{COLLECTION_NAME}'.")

months = ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05", "2024-06"]

invalid_data = {}

for month in months:
    docs = collection.find(
        {
            "_meta.updated": {"$gte": f"{month}-01", "$lte": f"{month}-31"},
            "licensed.declared": {"$exists": False},
        },
        max_time_ms=10000000,
    ).limit(5000)

    doc_count = collection.count_documents(
        {
            "_meta.updated": {"$gte": f"{month}-01", "$lte": f"{month}-31"},
            "licensed.declared": {"$exists": False},
        },
        max_time_ms=10000000,
    )

    invalid_data[month] = {
        "stats": {
            "sample_total": 0,
            "sample_invalid": 0,
        }
    }
    count = 0

    for doc in docs:
        count += 1
        blob = fetch_blob(
            BASE_AZURE_BLOB_URL,
            doc["coordinates"]["type"],
            doc["coordinates"]["provider"],
            doc["coordinates"].get("namespace", "-"),
            doc["coordinates"]["name"],
            doc["coordinates"]["revision"],
        )
        db_licensed = doc.get("licensed", {})
        blob_licensed = blob.get("licensed", {})

        if db_licensed.get("declared") != blob_licensed.get("declared"):
            # only adding the licensed and meta fields to the invalid data
            invalid_data[month][doc["_id"]] = {
                "db": {
                    "licensed": (db_licensed.get("declared")),
                    "_meta": doc.get("_meta", {}),
                },
                "blob": {
                    "licensed": (blob_licensed.get("declared")),
                    "_meta": blob.get("_meta", {}),
                },
            }

        # Checkpoint in case mongo dies
        if count % 100 == 0:
            print(
                f"Checkpoint: total number of invalid data: {len(invalid_data[month])}, total items {count} ({len(invalid_data[month])/count * 100}%)"
            )
            invalid_data[month]["stats"]["sample_total"] = count
            invalid_data[month]["stats"]["sample_invalid"] = len(invalid_data[month])
            dump_data(invalid_data, f"2024-invalid_data.json")

    invalid_data[month]["stats"]["total_documents"] = doc_count
    invalid_data[month]["stats"]["total_estimated_invalid"] = doc_count * (
        len(invalid_data[month]) / count
    )
    invalid_data[month]["stats"]["sample_percent_of_total"] = doc_count * (
        count / doc_count
    )
    dump_data(invalid_data, f"2024-invalid_data.json")
    print("Done")
    print(
        f"Total number of invalid data: {len(invalid_data[month])}, total items {count} ({len(invalid_data[month])/count * 100}%)"
    )
    dump_data(invalid_data, f"2024-invalid_data.json")

requirements.txt

azure-core==1.30.2
azure-cosmos==4.7.0
certifi==2024.6.2
charset-normalizer==3.3.2
dnspython==2.6.1
idna==3.7
pymongo==4.7.3
python-dotenv==1.0.1
requests==2.32.3
six==1.16.0
typing_extensions==4.12.2
urllib3==2.2.1

The script can be run with the following commands:

python3 -m venv .venv  
source .venv/bin/activate
python3 -m pip install -r requirements.txt
python3 script.py

The results of the script dump a json file with differences

I have attached the results of 2024 for deeper dives into those coordinates 2024-invalid-data.json

Thanks for providing the coordinates! After some digging, the following are some potential reasons for the missing definitions in Cosmos DB.

The absence of certain definitions in the new database collection can be attributed to the EU service using the older process. These missing definitions were discovered in the older less efficient collection:
- pypi/pypi/-/dataverk-airflow/1.7.1
- git/github/globalpayments/java-sdk/93ae65461cdc96d96402fa15fa65c87ef471e9d6
- git/github/globalpayments/java-sdk/1a347e095026ab9eb3e8cdae3954fdb1e2fd0583
- git/github/globalpayments/java-sdk/77b5504bd25b801142825484cc2c0e1177cabfb6
Some definitions were logged by the EU service but failed to be written to the older collection due to errors. Some of the cases are as follows:
- git/github/faktorips/faktorips.base/09892388a53fa8bcd1c0ec41f8b2a0b3d8b07795
  - blob "updated" timestamp: "2024-06-08T22:35:25.538Z"
  - Failure logged on EU service on SAT, 08 JUN 2024 22:35:28 GMT.
- git/github/ruby/net-ftp/deda220a422cf1ff560c60017312984fe34ea354
  - blob "updated" timestamp: "2024-06-10T00:40:51.850Z"
  - Failure logged on EU service on MON, 10 JUN 2024 00:40:52 GMT.
- pypi/pypi/-/xredutils/1.0.2
  - blob "updated" timestamp: "2024-06-10T09:47:40.673Z"
  - Failure logged by EU service on MON, 10 JUN 2024 09:47:40 GMT.
- pypi/pypi/-/rticonnextdds-connector/1.2.0
  - blob "updated" timestamp: "2024-06-10T09:59:24.031Z
  - Failure logged by EU service on MON, 10 JUN 2024 09:59:24 GMT
- git/github/chromedevtools/devtools-protocol/24b69f0bd145aad590ba370541548ad17e637845
  - blob "updated": "2024-06-10T10:33:04.068Z"
  - Failure logged by EU service on MON, 10 JUN 2024 10:33:04 GMT

@ajhenry There is a repo for scripts at the tools repo. @elrayle depending on the approach of back filling data, some migration tools by @mpcen in the tools repo might be helpful.

@qtomlinson Thanks for looking deeper into the results.

My plan is to add the script as a tool we can use periodically to be sure the data is in sync. This is excellent work. Thanks @ajhenry

I ran the script that count data that is out of sync between the DB and production-definitions container. Results...

Original results:

  "2024-06-01 thru 2024-06-20": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 493,
      "percent_invalid": 9.86%,
      "total_documents": 61892,
      "total_estimated_invalid": 6103,
      "sample_percent_of_total": 8.08%
    }

Results after changing service to use new efficient approach for definition generation and updating DB consistency setting from Session to Strong:

  "2024-06-21 thru 2024-06-28": {
    "stats": {
      "sample_total": 5000,
      "sample_invalid": 1,
      "total_documents": 10690,
      "total_estimated_invalid": 2.
      "sample_percent_of_total": 0.02%
    }

Status:

Backfill is proceeding.

Backfilling complete for Jan 2020 to Jul 2024. Impact: > 125k coordinates were out of sync and are now repaired.

clearlydefined / service