Open hagenw opened 1 year ago
In principle, we could also speed up loading of media files by storing only the version when the media file was created changed in the flavor of the corresponding version in the cache instead of copying the data to every version. We only have to add the correct paths when we update the index with the full paths.
It's not so nice, but for databases with lots of different versions our current approach is also not that nice.
I thought about this at some point, but it will add dependencies that are hard to resolve. E.g. when version 2.0.0
references files from 1.0.0
and then someone deletes 1.0.0
some media will be missing.
When media is missing it will be downloaded again.
When using the data, you will also have a problem at the moment when somebody else deletes the cache.
When media is missing it will be downloaded again.
But that means we have to check every time we load a database which files are missing. Very costly.
When media is missing it will be downloaded again.
But that means we have to check every time we load a database which files are missing. Very costly.
I would only check if the version folder is present in cache. If yes we assume that nobody deleted data.
At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.
At least I don't see why it should be different to now, you can also delete a single file in a flavor. If we check for all files in this case it should take the same time.
A user is not supposed to delete a single file from a flavor. But deleting whole versions is ok.
Another disadvantage is that the file paths might be different between caches on different machines.
But deleting whole versions is ok.
Yes, that why I said I would just check if the version folder exists which should be fine as long as we have <100.000 versions of the database.
Another disadvantage is that the file paths might be different between caches on different machines.
This sounds only like a problem if you use different caching folders together, which you might be able with the shared cache. So maybe we have to check how to handle this.
But I would anyway first start with addressing the table issue described here, as this slows down data publication and is more pressing in my opinion.
I would not care so much about data publication as you only do it once.
Once per version I mean :)
Yes and no.
If you have a growing database with lots of versions, you will experience it.
The other case it when testing the script, e.g. you will need to start uncomment audb.load_to()
as it takes ages even when more or less no data was changed.
The other case it when testing the script, e.g. you will need to start uncomment audb.load_to() as it takes ages even when more or less no data was changed.
Yes, that's indeed a pain.
The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use only_metadata=True
and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.
Not saying that it was better before, there you had to wait 30 minutes until all MD5SUMs for the media files were calculated ;)
The problem for tables was mitigated before because we used a shared build folder, now that we have switched to use
only_metadata=True
and dedicated build folders it means you have to wait up to 15 minutes even when just fixing a typo in the description of the database.
Could it help to work with two folders, e.g. cache/
and build/
:
cache/
build/
build/
It would not help as load_to()
would still not share the cache with load()
, which means if you have loaded a big database with load()
it can still not copy the tables from the cache and has to store them again.
Also not if you share cache/
across versions as we did with the build/
folder before?
It would not help when you load it for the first time, but when you publish a new version where you only change the description it should still safe time.
Loading the second time etc. would help, but why adding a second cache folder and not using the existing one?
In my opinion load_to()
, load_table()
and load()
could all use the same cache for the tables, and for this I would store it outside of the flavor folder under the version the table was created/changed. load()
can then copy the table from there to the flavor folder, load_table()
can directly load from there, and load_to()
can also copy from there.
Sure, just wanted to mention a possible workaround until this feature is available.
An easy workaround is start using a shared cache folder again (../build
) and just deleting media files inside, if some were added.
load()
can then copy the table from there to the flavor folder,load_table()
can directly load from there, andload_to()
can also copy from there.
Instead of copying we could also create a symbolic link. But probably for tables copying is fast enough. For media, however, symbolic links might offer a solution to share media files across versions without the need to change the file path in the tables.
The question is of course if Windows supports symbolic links?
Currently, the following steps are taken when loading tables:
load_to()
_find_tables()
returns list of tables, that don't have a CSV file indb_root
_get_tables()
removes PKL files for requested tables; load requested tables from backend and store as CSV files indb_root
_save_database()
load table by reading CSV file; store table as CSV file by overwriting existing CSV file; store table as PKL fileload_table()
cache/name/version
_cached_versions()
loadsdeps
for every version of the table it finds in cache_get_tables_from_cache()
(I do not understand completely what's going on, but MD5SUMs are calculated to check if a table from cache can be used). Copy PKL and CSV files if table can be found in cache_get_tables_from_backend()
gets CSV file from backend for table, and load it from CSV file. Store table as PKL file.load()
cache/name/version/flavor
The same as
load_table()
, but in the end we update indices of tables.I would propose the following changes to speed loading the tables up (besides fixing some obvious bugs in
load_to()
like storing the table twice as CSV and loading the table twice, or calculating MD5SUMs too often as described in https://github.com/audeering/audb/issues/226):cache/name/version
whereas the version is extracted fromdeps
, so basically only storing it under the version the table was added/modified to the databaseload_to()
can then copy tables from the cache as wellload()
we could copy from the cachecache/name/version
to the required flavor foldercache/name/version/flavor
; maybe even only copy the PKL fileload_table()
can directly use the new cache as no flavor is needed