Inconsistent number of references if update job is ran for multiple tags at once

The main reason I can see is this timeline:

Start indexing of defs in version N;
Done indexing of defs in version N;
Start indexing of defs in version N+1;
Start indexing of refs in version N.

For refs in version N, we take the tokenizer output and tokens that match a def in the database are considered to be refs. We might be catching some defs from version N+1, this depends on timing.

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I plan on addressing this as part of #289. This however means the outputs of the old update.py and the new one won't be exactly identical. This was a property I attempted to keep for easy testing.

bootlin / elixir

Inconsistent number of references if update job is ran for multiple tags at once #292