bootlin / elixir

The Elixir Cross Referencer
GNU Affero General Public License v3.0
934 stars 137 forks source link

Inconsistent number of references if update job is ran for multiple tags at once #292

Open fstachura opened 1 week ago

fstachura commented 1 week ago

To reproduce:

  1. Pull a git repository with more than one tag
  2. Run an update job twice, on an empty data directory each time
  3. Run this script on references.db from both data directories, compare results between databases. The number of references for some identifiers should be different.
  4. Run this script on both databases with one of identifiers that has a count difference and compare results (sort before diff). Entries for some files should be missing from one of the databases.
tleb commented 1 week ago

The main reason I can see is this timeline:

For refs in version N, we take the tokenizer output and tokens that match a def in the database are considered to be refs. We might be catching some defs from version N+1, this depends on timing.

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I plan on addressing this as part of #289. This however means the outputs of the old update.py and the new one won't be exactly identical. This was a property I attempted to keep for easy testing.