Closed elrayle closed 1 month ago
Adding some data from my deep dive investigation.
I wrote a script that queries the mongodb for all definitions that are missing the field license.declared
. From there I took the coordinates from those and cross referenced 5000 of them from the blob store to compare results.
What I found was around ~10% of definitions without a license.declared field are out of sync with the blob store. The timestamps do not match and the blob store is updated very shortly after the database.
Here are some stats for 2024 around the data I pulled from the database.
{
"2024-01": {
"stats": {
"sample_total": 5000,
"sample_invalid": 657,
"total_documents": 84167,
"total_estimated_invalid": 11059.5438,
"sample_percent_of_total": 5.94
}
},
"2024-02": {
"stats": {
"sample_total": 5000,
"sample_invalid": 227,
"total_documents": 31828,
"total_estimated_invalid": 1444.9912000000002,
"sample_percent_of_total": 15.71
}
},
"2024-03": {
"stats": {
"sample_total": 4436,
"sample_invalid": 414,
"total_documents": 4436,
"total_estimated_invalid": 414,
"sample_percent_of_total": 9.33
}
},
"2024-04": {
"stats": {
"sample_total": 5000,
"sample_invalid": 346,
"total_documents": 56627,
"total_estimated_invalid": 3918.5884,
"sample_percent_of_total": 8.83
}
},
"2024-05": {
"stats": {
"sample_total": 5000,
"sample_invalid": 499,
"total_documents": 116130,
"total_estimated_invalid": 11589.774,
"sample_percent_of_total": 4.31
}
},
"2024-06": {
"stats": {
"sample_total": 5000,
"sample_invalid": 493,
"total_documents": 61892,
"total_estimated_invalid": 6102.5512,
"sample_percent_of_total": 8.08
}
}
}
The script is here as follows:
You will need to set MONGO_CONNECTION_STRING
either in a .env
file or in the command line. This connection string should be read only and generated from the Azure cosmosDB dashboard.
script.py
import json
import os
import urllib.parse
import pymongo
import requests
from azure.cosmos import cosmos_client
from dotenv import load_dotenv
load_dotenv()
MONGO_CONNECTION_STRING = str(os.environ.get("MONGO_CONNECTION_STRING"))
DB_NAME = "clearlydefined"
COLLECTION_NAME = "definitions-trimmed"
BASE_AZURE_BLOB_URL = (
"https://clearlydefinedprod.blob.core.windows.net/production-definition"
)
# Example blob URL
# https://clearlydefinedprod.blob.core.windows.net/production-definition/composer/packagist/00f100/fcphp-cache/revision/0.1.0.json
# Mongo document with unused fields removed
# {
# "_id": "composer/packagist/00f100/fcphp-cache/0.1.0",
# "_meta": {
# "schemaVersion": "1.6.1",
# "updated": "2019-08-29T02:06:54.498Z"
# },
# "coordinates": {
# "type": "composer",
# "provider": "packagist",
# "namespace": "00f100",
# "name": "fcphp-cache",
# "revision": "0.1.0"
# },
# "licensed": {
# "declared": "MIT",
# "toolScore": {
# "total": 17,
# "declared": 0,
# "discovered": 2,
# "consistency": 0,
# "spdx": 0,
# "texts": 15
# },
# "score": {
# "total": 17,
# "declared": 0,
# "discovered": 2,
# "consistency": 0,
# "spdx": 0,
# "texts": 15
# }
# }
# }
def fetch_blob(base_url, type, provider, namespace, name, revision):
"""Fetch the blob from the azure blob storage"""
# need to encode the url for the %2f characters
url = urllib.parse.quote(
f"{type}/{provider}/{namespace}/{name}/revision/{revision}.json".lower()
)
url = f"{base_url}/{url}"
# Fetch the data from the blob storage
res = requests.get(url)
if res.status_code != 200:
return {}
return res.json()
def dump_data(data, filename):
with open(filename, "w") as f:
json.dump(data, f)
client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client[DB_NAME]
if DB_NAME not in client.list_database_names():
print(f"Database '{DB_NAME}' not found.")
else:
print(f"Using database: '{DB_NAME}'.")
collection = db[COLLECTION_NAME]
if COLLECTION_NAME not in db.list_collection_names():
print(f"Collection '{COLLECTION_NAME}' not found.")
else:
print(f"Using collection: '{COLLECTION_NAME}'.")
months = ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05", "2024-06"]
invalid_data = {}
for month in months:
docs = collection.find(
{
"_meta.updated": {"$gte": f"{month}-01", "$lte": f"{month}-31"},
"licensed.declared": {"$exists": False},
},
max_time_ms=10000000,
).limit(5000)
doc_count = collection.count_documents(
{
"_meta.updated": {"$gte": f"{month}-01", "$lte": f"{month}-31"},
"licensed.declared": {"$exists": False},
},
max_time_ms=10000000,
)
invalid_data[month] = {
"stats": {
"sample_total": 0,
"sample_invalid": 0,
}
}
count = 0
for doc in docs:
count += 1
blob = fetch_blob(
BASE_AZURE_BLOB_URL,
doc["coordinates"]["type"],
doc["coordinates"]["provider"],
doc["coordinates"].get("namespace", "-"),
doc["coordinates"]["name"],
doc["coordinates"]["revision"],
)
db_licensed = doc.get("licensed", {})
blob_licensed = blob.get("licensed", {})
if db_licensed.get("declared") != blob_licensed.get("declared"):
# only adding the licensed and meta fields to the invalid data
invalid_data[month][doc["_id"]] = {
"db": {
"licensed": (db_licensed.get("declared")),
"_meta": doc.get("_meta", {}),
},
"blob": {
"licensed": (blob_licensed.get("declared")),
"_meta": blob.get("_meta", {}),
},
}
# Checkpoint in case mongo dies
if count % 100 == 0:
print(
f"Checkpoint: total number of invalid data: {len(invalid_data[month])}, total items {count} ({len(invalid_data[month])/count * 100}%)"
)
invalid_data[month]["stats"]["sample_total"] = count
invalid_data[month]["stats"]["sample_invalid"] = len(invalid_data[month])
dump_data(invalid_data, f"2024-invalid_data.json")
invalid_data[month]["stats"]["total_documents"] = doc_count
invalid_data[month]["stats"]["total_estimated_invalid"] = doc_count * (
len(invalid_data[month]) / count
)
invalid_data[month]["stats"]["sample_percent_of_total"] = doc_count * (
count / doc_count
)
dump_data(invalid_data, f"2024-invalid_data.json")
print("Done")
print(
f"Total number of invalid data: {len(invalid_data[month])}, total items {count} ({len(invalid_data[month])/count * 100}%)"
)
dump_data(invalid_data, f"2024-invalid_data.json")
requirements.txt
azure-core==1.30.2
azure-cosmos==4.7.0
certifi==2024.6.2
charset-normalizer==3.3.2
dnspython==2.6.1
idna==3.7
pymongo==4.7.3
python-dotenv==1.0.1
requests==2.32.3
six==1.16.0
typing_extensions==4.12.2
urllib3==2.2.1
The script can be run with the following commands:
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
python3 script.py
The results of the script dump a json file with differences
I have attached the results of 2024 for deeper dives into those coordinates 2024-invalid-data.json
Thanks for providing the coordinates! After some digging, the following are some potential reasons for the missing definitions in Cosmos DB.
The absence of certain definitions in the new database collection can be attributed to the EU service using the older process. These missing definitions were discovered in the older less efficient collection:
Some definitions were logged by the EU service but failed to be written to the older collection due to errors. Some of the cases are as follows:
@ajhenry There is a repo for scripts at the tools repo. @elrayle depending on the approach of back filling data, some migration tools by @mpcen in the tools repo might be helpful.
@qtomlinson Thanks for looking deeper into the results.
My plan is to add the script as a tool we can use periodically to be sure the data is in sync. This is excellent work. Thanks @ajhenry
I ran the script that count data that is out of sync between the DB and production-definitions container. Results...
Original results:
"2024-06-01 thru 2024-06-20": {
"stats": {
"sample_total": 5000,
"sample_invalid": 493,
"percent_invalid": 9.86%,
"total_documents": 61892,
"total_estimated_invalid": 6103,
"sample_percent_of_total": 8.08%
}
Results after changing service to use new efficient approach for definition generation and updating DB consistency setting from Session to Strong:
"2024-06-21 thru 2024-06-28": {
"stats": {
"sample_total": 5000,
"sample_invalid": 1,
"total_documents": 10690,
"total_estimated_invalid": 2.
"sample_percent_of_total": 0.02%
}
Status:
Backfill is proceeding.
Backfilling complete for Jan 2020 to Jul 2024. Impact: > 125k coordinates were out of sync and are now repaired.
Description
The Cosmos DB definitions do not always match the production-definition blobs. This leads to missing licenses in the UI search result list and in the changes-notifications definitions.
Attempts at reducing out of sync problems
Approach to improve changes-notifications
The change notification script uses the database to find recent changes and gather the definition for each changed coordinate. As the
production-definition
blobs are the source of truth, the script will be updated to use these blob store definitions as the source for definitions recorded in thechanges-notifications
blob store.Backfilling data
A script will be needed to find DB entries with missing licenses, compare them to the
production-definition
blob, and update the DB definition as needed. The script used for exploration can likely be modified to perform the update task.This script can also be used to backfill the
changes-notifications
definitions to resync them withproduction-definition
blobs at the same time as the resync for DB definitions. See https://github.com/clearlydefined/operations/issues/81 for more information.NOTE: There may be other sync errors besides the missing license. It would be very expensive to touch every DB entry and compare it to the corresponding blob. Addressing just the ones missing licenses will have the biggest benefit at the lowest possible cost. The processing will still be extensive.
Related Work