Closed stevevanhooser closed 2 years ago
Action items as discussed:
Add database.getPreference(name, value), database.setPreference(name) methods
Add sqlitedb preferences for:
In sqlitedb constructor, check the preferences and delete stale cached files accordingly
In sqlitedb.do_add_doc()
Create subclass did.file.readonly_fileobj < did.file.fileobj that:
Implement sqlitedb.openbinarydoc():
Implement sqlitedb.closebinarydoc():
Steve will upload sample did doc object(s) with the isLocal and urlLocation fields
Some revision to that section, as I write an add_file
method to did.document
:
"files": {
"file_list": [
"filename1.ext",
"filename2.ext"
],
"file_info": [
{
"name": "filename1.ext",
"locations": [
{
"delete_original": 1,
"uid": "41268bfa11faa5ff_c0ddad560858299e",
"location": "C:\\Users\\username\\123456789.ext1",
"parameters": "",
"location_type": "file",
"ingest": 1
},
{
"delete_original": 0,
"uid": "41268bfa12287f64_40dbbdf6582271b2",
"location": "https://myserver.com/123456789.ext1",
"parameters": "",
"location_type": "url",
"ingest": 0
}
]
},
{
"name": "filename2.ext2",
"locations": [
{
"delete_original": 1,
"uid": "41268bfa13b7992b_40d079ab70659331",
"location": "C:\\Users\\username\\987654321.ext2",
"parameters": "",
"location_type": "file",
"ingest": 1
},
{
"delete_original": 0,
"uid": "41268bfa13d2bc22_c0bb8d41c21210cf",
"location": "https://myserver.com/987654321.ext2",
"parameters": "",
"location_type": "url",
"ingest": 0
}
]
}
]
}
Comments: it makes more sense to put the uid with the file's representation on the disk or media rather than at the top level with the name (because the uid is to make it possible to store the files without hierarchy and allowing different documents to use the same "name").
ingest
is a 0/1 value that indicates whether the database should ingest the document (copy it internally). This is done at the time of did.database.add_doc()
. delete_original
indicates whether the database should delete the original. Again, this is done at the time of did.database.add_doc()
.
file_list
allows the document type to specify which files it is allowed to have (there should be a set list by type).
Here's an example document, but you'll need to create the files yourself (or modify the document_properties.file.file_info(j).locations(1).location field for j=1 and 2)
I'll work on making a demo/test that actually creates example files but this can get you started.
new test code on the branch:
doc = did.test.test_did_file_document()
creates an example document with 2 files
Your sample document includes 2 files that are cached locally. Why 2 files? are they exact replicas of one another? if not, then which of these should the database open when you call openbinarydoc()?
It’s not something that someone would do often (or maybe ever), but it is just for testing. The behavior of the program should be to store both files. It helps to test whether each entry has a unique ID (if they didn’t, and shared an ID, then the two file names would collide and that would be an error). Not a typical use case.
On Oct 31, 2022, at 1:53 PM, Yair Altman @.***> wrote:
Your sample document includes 2 files that are cached locally. Why 2 files? are they exact replicas of one another? if not, then which of these should the database open when you call openbinarydoc()?
— Reply to this email directly, view it on GitHub https://github.com/VH-Lab/DID-matlab/issues/31#issuecomment-1297458319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOPDOONVFF5P5FTFUA6ITWGABSBANCNFSM6AAAAAARHSUL64. You are receiving this because you authored the thread.
The latest commit should enable you to test the open_doc() functionality.
Sample test code:
doc = did.test.test_did_file_document();
db = did.implementations.sqlitedb('test.sqlite');
%winopen test.sqlite %for debugging
db.add_branch('root');
db.add_doc(doc);
file_obj = db.open_doc('41268c13967e26b4_c0d504560ca6769c') %or: db.open_doc(doc)
The returned file_obj
is a wrapper object of class did.file.readonly_fileobj
Notes:
did_globals.path.filecachepath
by defaultHi Yair -
Thanks for this. I have a few comments
In the new work we are creating, each document should be able to have more than one binary portion. filename1.ext
and filename2.ext
are two files associated with the example document.
Therefore, open_doc
needs to have both the document and the name of the binary file to open. For example:
file_obj = db.open_doc(doc, 'filename1.ext');
Then open_doc
looks up the reference number of filename1.ext, and returns the file_obj. The following would be possible:
file_obj1 = db.open_doc(doc, 'filename1.ext');
file_obj2 = db.open_doc(doc, 'filename2.ext');
It should look through the various definitions of 'filename1.ext', and return a file object to a local file if possible, or, if not possible, to download the remote version.
I see that the way you implemented it so far is that it thinks there are two versions of the same file. Instead, there is a single local file for each, and a redundant web link (that doesn't work, just for example). I see now that I misunderstood your earlier question. I thought I had given two local files for one of these (like filename2.ext
), which could happen. I'm not sure that should be an error (kind of an odd use case, would only be for redundancy I guess). I had toyed with this in creating the example but I didn't include it in the end I see.
The test code works for me, but needs to be modified to support the multiple binary files per document as above please.
Thanks! Steve
Fixed. open_doc()
now accepts an optional 2nd input argument of filename. Note that in case there are multiple files, only the first matching cached file is returned (e.g if you don't specify the optional filename
parameter and several files are cached, only one of them will be returned).
Nice job, I modified did.test.test_did_file_document.m to include checking the contents of the files and it seems to work.
I think the only thing that should be changed is that I think the user should provide the filename they are trying to read from (the name in the file_list, not necessarily the name on disk). Right now it is optional, but if there is more than one file in the file_list then it is vague which one would be opened, right? I think it should be an error not to provide it. Even if there is only one item in file_list at a certain time, if an application does not request a specific file and the definition of the class changes later to have more than one, errors would occur.
Thanks! Steve
I think that it would be more user-friendly to keep the filename optional, and in the case of 2+ files either return an array of file objects, or generate an error to force the user to specify the desired filename. But for the common case of just a single file, it would be more user-friendly not to have to specify the filename. Let me know what you decide.
I think we should have the user specify; I agree it is less convenient for an expert, but it makes it more reliable that the person is really reading the file they think they are. The end coders (scientists) will have a variety of experience levels, and I can see a person reading the wrong file by accident (and not figuring it out for awhile).
If we returned an array, we'd have to tell them which file was which somehow (it could follow the file_list order).
So, I think it should be specified, for reliability at the expense of convenience.
Done
Wonderful, tests pass, closing issue
We need to make two changes to the database architecture to support binary files for documents.
In the previous treatment, one could open the single allowed binary file for a database document that could be opened method
openbinarydoc
(which gave read/write access) and then closed withclosebinarydoc
. We'd like to replace this formulation for a few reasons.spike_times.bin
andspike_waveforms.bin
; right now, these have to be in separate documents, which might lead to unnecessary document proliferation.)To allow all these features, we will add a new reserved JSON field that may be included (but is not required) called
files
.It will be organized as follows:
Some comments.
For the local sql database (and most local databases), if a document to be added to the database has a file URL, then, upon input, it must have an absolute path. When the database adds the file, the database should give the file the unique identifier as its filename and relabel the URL to refer to the new filename. Changing the name to a unique identifier is necessary because the database generally stores its files in a single local directory without hierarchies, and because different documents could store files of the same name. It should also refer to the file by a relative url, and in the context of the database, the relative location is the location where the local database will store its files.
The methods
openbinarydoc
andclosebinarydoc
will need to be changed.open_binary_doc
andclose_binary_doc
.close_binary_doc
also takes the file object itself as an input so it knows which file to close. (Maybe this method isn't necessary now that they are only read only? The user could just call fileobj.close() )? The binary file name here is the "name" field. The unique identifier and actual file location that the database uses can be hidden from the user; the user just calls the ith file bydocument_properties.files(i).name
.did.globals()
) to see if the file is stored locally (as unique_identifer.bin). If it is not, then the file is downloaded and saved to the cache directory as unique_identifier.bin. The local fileobj to the downloaded file is returned fromopen_binary_doc
.