Open ethho opened 4 months ago
The first step is to design a consistent file folder structure that's prescribed by DataJoint, based on the schema design and primary key values. For the final solution, we need to consider the following:
To this end, we have considered several classes of algorithms for generating file paths from primary keys:
uuid = hash(schema + table + md5(contents))
Asdfkjb1234
Asdf/kjb/1234.mp4
generate_uuid_from_pkey()
) to generate this unique key reproducibly, determine the file path, and fetch the file, without needing to query the database.ls
.generate_uuid_from_pkey()
. Assuming that this function is packaged in datajoint-python
, the user would need to have datajoint-python
installed to determine the file path./<schema-name>/<table-name>/<primary key attr 1>/<primary key attr 2>/< ... >/<last primary key attr>.mp4
/<schema-name>/<table-name>/<primary key attr 1>/<primary key attr 2>/< ... >/<last primary key attr>-<md5(file contents)>.mp4
md5
vs other hashing algorithms is necessary.rsync
uses md5
by default.md5
is not cryptographically secure, but it is fast and has a low collision rate.cc: @dimitri-yatsenko
Feature Request
Problem Statement
While the database provides data structure, efficient queries, and transaction support, files are still preferred for strong large objects such as images, numerical arrays, movies, etc. Users like to have direct read-only access to the files without mediation by the database. Storing large objects in MySQL tables has adverse performance effects on data queries. DataJoint has previously implemented several approaches to address some aspects of this problem:
attach
andattach@store
datatype to store files, preserving the filename but not the folder structuresblob@store
datatype for storing serialized data structures in external filesfilepath@store
datatype to allow organizing files and folders under users' controlAdapatedType
datatype that allows defining custom logic to apply for reading and writing.In particular, the SpyGlass pipeline Loren Frank's lab relied on the
filepath
andAdaptedType
features to implement NWB file management. None of these methods simultaneously address the following desiderata:datajoint-python
or DB access, and files should maintain their native file extensions and MIME types (as opposed to serializing into another format).We need a solution for file management that simultaneously addresses all of these desiderata.