Improve indexing performance (`update.py`)

tleb commented 1 week ago

Current index feels much slower than it ought to be. I am creating this issue to track work on this topic.

See this message by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).

I've got proof-of-concepts for the following topics:

Rework of update.py. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and uses multiprocessing.Pool to spawn sub-processes).
Improve individual commands of script.sh. Some commands are more wasteful than needed.
- The sed(1) call in list-blobs is a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag.
- find-file-doc-comments.pl in parse-docs is really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.

Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old update.py, I passed my CPU count as argument ie 20.

Those changes will require a way to compare databases, see this message for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses db_dump -p and diff, as recommended here.

There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.

We might want to change the overall structure: calling into a shell script for each blob, spawning multiple processes, is not the fastest way to solve the problem. We could have script.sh commands take multiple blobs.
Or we could avoid script.sh and calls ctags or tokenize by ourselves.
We could change the database structure. Current database compresses well (14G becomes 5.2G after zstd -1), which means there is superfluous information. The value format could be optimized, possibly made binary.

tleb commented 1 week ago

Answering here to this comment:

Another thing about the update script, it often does lookups in the definitions and references database. I think having our own cache for definitions would improve speed.

I do not believe this to be an issue. Database is in cache, lookup is really fast. Small benchmarks agree on this.

And postponing key updates until a file is processed (in update_definitions and references) could probably help too.

This, yes! I've avoided using RefList and DefList that do loads of string concat which is harmful for performance. We need to remember we have a single process+thread that must handle all operations on the database. We cannot have it spend its CPU cycles to do string concat (in Python). This saturates memory management as well.

Apparently Berkeley DB supports batch inserts, but I'm not sure if it's possible to utilize that from the python wrapper unfortunately

I did not look into this, but I agree it might be useful. Batch insert could be useful. If you can reference your claim of it not being available from the Python wrapper it would be nice.

fstachura commented 1 week ago

If you can reference your claim of it not being available from the Python wrapper it would be nice.

I don't have a complete proof that bsddb does not support bulk operations.

Here are some examples of bulk operations in C with claims that bulk operations improve performance: https://docs.oracle.com/cd/E17276_01/html/programmer_reference/am_misc_bulk.html

I didn't read too much into it, but it seems that you have to use special macros to construct a bulk buffer. Which suggests, that this would need special handing in the wrapper.

DB_MULTIPLE is mentioned in docs of the put method https://docs.oracle.com/cd/E17276_01/html/api_reference/C/dbput.html

However, I don't see any mentions of bulk operations or related flags in bsddb docs https://pybsddb.sourceforge.net/bsddb3.html

The put method only works on strings or bytes too. I couldn't really find anything related to bulk operations in the source code, but again, I didn't look into it much. https://hg.jcea.es/pybsddb/file/6.2.9/Modules/_bsddb.c

bsddb was replaced by berkeleydb (same author, it seems). Maybe something changed, although I still don't see any mention of bulk operations in the docs https://pypi.org/project/berkeleydb/ https://docs.jcea.es/berkeleydb/latest/

tleb commented 1 week ago

I don't have a complete proof that bsddb does not support bulk operations.

Well the code link you provided says it all. As you said, it only accepts bytes objects. And grepping for DB_MULTIPLE gives no result (except in the TODO file). Thanks for the investigation!

bootlin / elixir

Improve indexing performance (`update.py`) #289