clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 39 forks source link

Changes delivery misses some packages #1005

Open RomanIakovlev opened 9 months ago

RomanIakovlev commented 9 months ago

I have one example of a package which is available from ClearlyDefined API, but wasn't delivered via the changes notification mechanism into Azure Blob Storage.

Here's the package I'm talking about: https://clearlydefined.io/definitions/maven/mavencentral/org.eclipse.jetty/jetty-servlets/11.0.15

It's been harvested in May 2023, but it's not contained in any of changeset files here: https://clearlydefinedprod.blob.core.windows.net/production-snapshots/changes/year-month-date-hour.

DragosDanielBoia commented 8 months ago

@RomanIakovlev will look into this, do you know what is the scale for this issue? trying to figure out if it's only for a few packages or is happening more frequently.

RomanIakovlev commented 8 months ago

I think the scale of the problem is ~5M of packages. Counting all the records in the CosmosDB gives me ~37M entries, while the data published in the storage container has ~32M records. I don't yet have a list of differences between these two datasets, but I will try and build one, maybe that would shed some light on the issue.

DragosDanielBoia commented 8 months ago

@RomanIakovlev will look by the end of next week

RomanIakovlev commented 8 months ago

I've calculated the diff between what's in CD database and what was exported. Here's the csv file with those ids: https://clearlydefineddevbackup.blob.core.windows.net/missingids/part-00000-6049046a-0d57-4e78-b5d2-ad514664a53d-c000.csv.

At a cursory glance, it seems that the missing ids include all types of packages in their usual proportions, so the problem is not persistent to a certain package type:

package count package type
1257 composer
6615 crate
119 deb
52 debsrc
3834 gem
220256 git
424208 go
378957 maven
2946643 npm
565641 nuget
122 pod
316395 pypi
135685 sourcearchive
4999784 TOTAL
RomanIakovlev commented 8 months ago

I've done some more investigations and here's what I think is going on here. First, after gathering all the missing package ids I've done a small modification to the publish changes job to try and run the publish process only for those missing ids. It all went through just fine, however I've observed CosmosDB read timeouts along the way.

Those timeouts, as well as other IO errors, are not handled in the code (unless one considers an empty catch expression being error handling), and when they happen, the process just exits. There is also no retry mechanism, and no logging.

My current guess is that the publish process just ran into problems like the described above, and then continued from the next changeset (hour), and all the unprocessed changes from the previous changeset (hour) were ignored, and that's how we ended up with 5M missing packages. I'm still not 100% sure this is the case, but I don't see any indications to other problems. The small modification mentioned above that I've done was to add a CosmosDB read retry, and with that I was able to process all the missing packages. I presume this would be a sufficient fix to get rid of this problem.

However I think it's a good time to make this job more robust. I propose we also do several further things to that end:

  1. Make sure it runs in Azure
  2. Implement retries
  3. Implement error logging and monitoring

Number 1 would be quite a lot of work (we'll probably have to build a docker container, implement some CI, e.g. GitHub Actions, to build and publish it, implement secret handling for connection strings, etc. I will start working on it, however @DragosDanielBoia please let me know if you think this should be done differently, or if you have any preference in terms of how it should run in Azure.

And after those changes are made, we will probably just nuke the /changes directory and restart the publishing process from scratch, with the hope it will produce those missing packages.

DragosDanielBoia commented 8 months ago

@RomanIakovlev I think all of these sounds good, thanks for doing this.

qtomlinson commented 8 months ago

@RomanIakovlev The clearlydefined search api can be used to confirm that the component is in the cosmoDB. maven/mavencentral/org.eclipse.jetty/jetty-servlets/11.0.15 is indeed in the db. Can the db timeout be related to the DB throughput provisioning? Just a thought.