Speedup loading of tables (from cache)

hagenw commented 1 year ago

Currently, the following steps are taken when loading tables:

load_to()

cache: no cache used
_find_tables() returns list of tables, that don't have a CSV file in db_root
_get_tables() removes PKL files for requested tables; load requested tables from backend and store as CSV files in db_root
_save_database() load table by reading CSV file; store table as CSV file by overwriting existing CSV file; store table as PKL file

load_table()

cache: cache/name/version
checks if both a CSV file and a PKL file of the table exists in cache
_cached_versions() loads deps for every version of the table it finds in cache
_get_tables_from_cache() (I do not understand completely what's going on, but MD5SUMs are calculated to check if a table from cache can be used). Copy PKL and CSV files if table can be found in cache
_get_tables_from_backend() gets CSV file from backend for table, and load it from CSV file. Store table as PKL file.
It then loads the table again from PKL file

load()

cache: cache/name/version/flavor

The same as load_table(), but in the end we update indices of tables.

I would propose the following changes to speed loading the tables up (besides fixing some obvious bugs in load_to() like storing the table twice as CSV and loading the table twice, or calculating MD5SUMs too often as described in https://github.com/audeering/audb/issues/226):

store each table as CSV and PKL under cache/name/version whereas the version is extracted from deps, so basically only storing it under the version the table was added/modified to the database
load_to() can then copy tables from the cache as well
for load() we could copy from the cache cache/name/version to the required flavor folder cache/name/version/flavor; maybe even only copy the PKL file
load_table() can directly use the new cache as no flavor is needed

hagenw commented 1 year ago

In principle, we could also speed up loading of media files by storing only the version when the media file was created changed in the flavor of the corresponding version in the cache instead of copying the data to every version. We only have to add the correct paths when we update the index with the full paths.

It's not so nice, but for databases with lots of different versions our current approach is also not that nice.

frankenjoe commented 1 year ago

I thought about this at some point, but it will add dependencies that are hard to resolve. E.g. when version 2.0.0 references files from 1.0.0 and then someone deletes 1.0.0 some media will be missing.

hagenw commented 1 year ago

When media is missing it will be downloaded again.

When using the data, you will also have a problem at the moment when somebody else deletes the cache.

frankenjoe commented 1 year ago

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

hagenw commented 1 year ago

When media is missing it will be downloaded again.

But that means we have to check every time we load a database which files are missing. Very costly.

I would only check if the version folder is present in cache. If yes we assume that nobody deleted data.

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

frankenjoe commented 1 year ago

At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.

A user is not supposed to delete a single file from a flavor. But deleting whole versions is ok.

Another disadvantage is that the file paths might be different between caches on different machines.

hagenw commented 1 year ago

But deleting whole versions is ok.

Yes, that why I said I would just check if the version folder exists which should be fine as long as we have <100.000 versions of the database.

Another disadvantage is that the file paths might be different between caches on different machines.

This sounds only like a problem if you use different caching folders together, which you might be able with the shared cache. So maybe we have to check how to handle this.

hagenw commented 1 year ago

But I would anyway first start with addressing the table issue described here, as this slows down data publication and is more pressing in my opinion.

frankenjoe commented 1 year ago

I would not care so much about data publication as you only do it once.

frankenjoe commented 1 year ago

Once per version I mean :)

hagenw commented 1 year ago

Yes and no.

If you have a growing database with lots of versions, you will experience it. The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

frankenjoe commented 1 year ago

The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.

Yes, that's indeed a pain.

hagenw commented 1 year ago

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.

Not saying that it was better before, there you had to wait 30 minutes until all MD5SUMs for the media files were calculated ;)

frankenjoe commented 1 year ago

The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.

Could it help to work with two folders, e.g. cache/ and build/:

load previous version to cache/
update database and save to build/
publish from build/

hagenw commented 1 year ago

It would not help as load_to() would still not share the cache with load(), which means if you have loaded a big database with load() it can still not copy the tables from the cache and has to store them again.

frankenjoe commented 1 year ago

Also not if you share cache/ across versions as we did with the build/ folder before?

frankenjoe commented 1 year ago

It would not help when you load it for the first time, but when you publish a new version where you only change the description it should still safe time.

hagenw commented 1 year ago

Loading the second time etc. would help, but why adding a second cache folder and not using the existing one?

In my opinion load_to(), load_table() and load() could all use the same cache for the tables, and for this I would store it outside of the flavor folder under the version the table was created/changed. load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

frankenjoe commented 1 year ago

Sure, just wanted to mention a possible workaround until this feature is available.

hagenw commented 1 year ago

An easy workaround is start using a shared ~~cache~~ folder again (../build) and just deleting media files inside, if some were added.

frankenjoe commented 1 year ago

load() can then copy the table from there to the flavor folder, load_table() can directly load from there, and load_to() can also copy from there.

Instead of copying we could also create a symbolic link. But probably for tables copying is fast enough. For media, however, symbolic links might offer a solution to share media files across versions without the need to change the file path in the tables.

The question is of course if Windows supports symbolic links?

audeering / audb