VH-Lab / DID-matlab

Data Interface Database
Other
1 stars 1 forks source link

need to modify file methods #31

Closed stevevanhooser closed 2 years ago

stevevanhooser commented 2 years ago

We need to make two changes to the database architecture to support binary files for documents.

In the previous treatment, one could open the single allowed binary file for a database document that could be opened method openbinarydoc (which gave read/write access) and then closed with closebinarydoc. We'd like to replace this formulation for a few reasons.

  1. Opening the binary doc for read/write access after the accompanying document is added to the database violates the more recent concept of immutability of the documents. The dependency and branching concepts both depend on documents being immutable once created. We should only allow read access to the binary items once the document is added to the database.
  2. Having a single binary file per document is limiting. If one wanted to store two pieces of data, one would have to interdigitate them into the binary file. We want to add the ability to add arbitrary binary files. (For example, one document might have spike_times.bin and spike_waveforms.bin; right now, these have to be in separate documents, which might lead to unnecessary document proliferation.)
  3. We want to be ready to add cloud-based databases, so we want to add the ability to store files locally, to grab files via https, potentially to grab from other services, and to offer redundant files (in case one server is down).

To allow all these features, we will add a new reserved JSON field that may be included (but is not required) called files.

It will be organized as follows:

"files": [
   {
       "name":   "filename1.ext",
       "type":     "a comment about the file type",
       "comment": "any comment",
       "uid": "a unique identifier here"
       "locations": [
            {
                  "location_type":  "url",
                  "location":     "url, such as file://123456789.ext1",
                  "parameters": "a string to indicate any parameters to access the resource"
            },
            {
                  "location_type":  "url":,
                  "location":     "redundant url, such as https://myserver.com/123456789.ext1",
                  "parameters": "a string to indicate any parameters to access the resource"
            }
      ]
  },
   {
       "name":   "filename2.ext",
       "type":     "a comment about the file type",
       "comment": "any comment"
       "locations": [
            {
                  "location_type":  "url",
                  "location":     "url, such as file://987654321.ext1",
                  "parameters": "a string to indicate any parameters to access the resource"
            },
            {
                  "location_type":  "url",
                  "location":     "redundant url, such as https://myserver.com/987654321.ext1",
                  "parameters": "a string to indicate any parameters to access the resource"
            }
      ]
  }
]

Some comments.

For the local sql database (and most local databases), if a document to be added to the database has a file URL, then, upon input, it must have an absolute path. When the database adds the file, the database should give the file the unique identifier as its filename and relabel the URL to refer to the new filename. Changing the name to a unique identifier is necessary because the database generally stores its files in a single local directory without hierarchies, and because different documents could store files of the same name. It should also refer to the file by a relative url, and in the context of the database, the relative location is the location where the local database will store its files.

The methods openbinarydoc and closebinarydoc will need to be changed.

  1. We should change their names so they match the underscore style: open_binary_doc and close_binary_doc.
  2. We need to change the input arguments so that they are a) the database object and b) the binary file name to be requested. close_binary_doc also takes the file object itself as an input so it knows which file to close. (Maybe this method isn't necessary now that they are only read only? The user could just call fileobj.close() )? The binary file name here is the "name" field. The unique identifier and actual file location that the database uses can be hidden from the user; the user just calls the ith file by document_properties.files(i).name.
  3. For now, the only other location we will implement is a url. If there is a url, then when open_binary_doc is called, the database should check a cache directory (maybe we can define this in did.globals()) to see if the file is stored locally (as unique_identifer.bin). If it is not, then the file is downloaded and saved to the cache directory as unique_identifier.bin. The local fileobj to the downloaded file is returned from open_binary_doc.
altmany commented 2 years ago

Action items as discussed:

  1. Add database.getPreference(name, value), database.setPreference(name) methods

  2. Add sqlitedb preferences for:

    • remote folder (default: same as sqlite file's folder)
    • cache folder (default: tempdir)
    • cache duration in [days] (default: 1.0); 0 means upon next sqlitedb open, inf means never
    • cache max number of files (default: inf); cached files deleted based on LRU
  3. In sqlitedb constructor, check the preferences and delete stale cached files accordingly

  4. In sqlitedb.do_add_doc()

    • if the new doc contains a local file, copy it to prefs.remote_folder with a filename of doc_id
    • store the remote file location in DB
  5. Create subclass did.file.readonly_fileobj < did.file.fileobj that:

    • sets fopen permission to 'r'
    • croaks on fwrite()
  6. Implement sqlitedb.openbinarydoc():

    • if doc is a remote file, download local copy to the cache folder
    • return a did.file.readonly_fileobj wrapper object to the local file
  7. Implement sqlitedb.closebinarydoc():

    • close the file handle, if open (error otherwise)
    • delete cached file (if any)
  8. Steve will upload sample did doc object(s) with the isLocal and urlLocation fields

stevevanhooser commented 2 years ago

Some revision to that section, as I write an add_file method to did.document:

"files": {
  "file_list": [
    "filename1.ext",
    "filename2.ext"
  ],
  "file_info": [
    {
      "name": "filename1.ext",
      "locations": [
        {
          "delete_original": 1,
          "uid": "41268bfa11faa5ff_c0ddad560858299e",
          "location": "C:\\Users\\username\\123456789.ext1",
          "parameters": "",
          "location_type": "file",
          "ingest": 1
        },
        {
          "delete_original": 0,
          "uid": "41268bfa12287f64_40dbbdf6582271b2",
          "location": "https://myserver.com/123456789.ext1",
          "parameters": "",
          "location_type": "url",
          "ingest": 0
        }
      ]
    },
    {
      "name": "filename2.ext2",
      "locations": [
        {
          "delete_original": 1,
          "uid": "41268bfa13b7992b_40d079ab70659331",
          "location": "C:\\Users\\username\\987654321.ext2",
          "parameters": "",
          "location_type": "file",
          "ingest": 1
        },
        {
          "delete_original": 0,
          "uid": "41268bfa13d2bc22_c0bb8d41c21210cf",
          "location": "https://myserver.com/987654321.ext2",
          "parameters": "",
          "location_type": "url",
          "ingest": 0
        }
      ]
    }
  ]
}

Comments: it makes more sense to put the uid with the file's representation on the disk or media rather than at the top level with the name (because the uid is to make it possible to store the files without hierarchy and allowing different documents to use the same "name").

ingest is a 0/1 value that indicates whether the database should ingest the document (copy it internally). This is done at the time of did.database.add_doc(). delete_original indicates whether the database should delete the original. Again, this is done at the time of did.database.add_doc().

file_list allows the document type to specify which files it is allowed to have (there should be a set list by type).

stevevanhooser commented 2 years ago

file_document.mat.zip

Here's an example document, but you'll need to create the files yourself (or modify the document_properties.file.file_info(j).locations(1).location field for j=1 and 2)

I'll work on making a demo/test that actually creates example files but this can get you started.

stevevanhooser commented 2 years ago

new test code on the branch:

doc = did.test.test_did_file_document()

creates an example document with 2 files

altmany commented 2 years ago

Your sample document includes 2 files that are cached locally. Why 2 files? are they exact replicas of one another? if not, then which of these should the database open when you call openbinarydoc()?

stevevanhooser commented 2 years ago

It’s not something that someone would do often (or maybe ever), but it is just for testing. The behavior of the program should be to store both files. It helps to test whether each entry has a unique ID (if they didn’t, and shared an ID, then the two file names would collide and that would be an error). Not a typical use case.

On Oct 31, 2022, at 1:53 PM, Yair Altman @.***> wrote:

Your sample document includes 2 files that are cached locally. Why 2 files? are they exact replicas of one another? if not, then which of these should the database open when you call openbinarydoc()?

— Reply to this email directly, view it on GitHub https://github.com/VH-Lab/DID-matlab/issues/31#issuecomment-1297458319, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOOPDOONVFF5P5FTFUA6ITWGABSBANCNFSM6AAAAAARHSUL64. You are receiving this because you authored the thread.

altmany commented 2 years ago

The latest commit should enable you to test the open_doc() functionality.

Sample test code:

doc = did.test.test_did_file_document();
db = did.implementations.sqlitedb('test.sqlite');
%winopen test.sqlite   %for debugging
db.add_branch('root');
db.add_doc(doc);
file_obj = db.open_doc('41268c13967e26b4_c0d504560ca6769c')   %or: db.open_doc(doc)

The returned file_obj is a wrapper object of class did.file.readonly_fileobj

Notes:

  1. the local files are cached in the folder specified by did_globals.path.filecachepath by default
  2. cached files are not currently deleted - we need to discuss if/how to implement this cleanup
stevevanhooser commented 2 years ago

Hi Yair -

Thanks for this. I have a few comments

In the new work we are creating, each document should be able to have more than one binary portion. filename1.ext and filename2.ext are two files associated with the example document.

Therefore, open_doc needs to have both the document and the name of the binary file to open. For example:

file_obj = db.open_doc(doc, 'filename1.ext');

Then open_doc looks up the reference number of filename1.ext, and returns the file_obj. The following would be possible:

file_obj1 = db.open_doc(doc, 'filename1.ext');
file_obj2 = db.open_doc(doc, 'filename2.ext');

It should look through the various definitions of 'filename1.ext', and return a file object to a local file if possible, or, if not possible, to download the remote version.

I see that the way you implemented it so far is that it thinks there are two versions of the same file. Instead, there is a single local file for each, and a redundant web link (that doesn't work, just for example). I see now that I misunderstood your earlier question. I thought I had given two local files for one of these (like filename2.ext), which could happen. I'm not sure that should be an error (kind of an odd use case, would only be for redundancy I guess). I had toyed with this in creating the example but I didn't include it in the end I see.

The test code works for me, but needs to be modified to support the multiple binary files per document as above please.

Thanks! Steve

altmany commented 2 years ago

Fixed. open_doc() now accepts an optional 2nd input argument of filename. Note that in case there are multiple files, only the first matching cached file is returned (e.g if you don't specify the optional filename parameter and several files are cached, only one of them will be returned).

stevevanhooser commented 2 years ago

Nice job, I modified did.test.test_did_file_document.m to include checking the contents of the files and it seems to work.

I think the only thing that should be changed is that I think the user should provide the filename they are trying to read from (the name in the file_list, not necessarily the name on disk). Right now it is optional, but if there is more than one file in the file_list then it is vague which one would be opened, right? I think it should be an error not to provide it. Even if there is only one item in file_list at a certain time, if an application does not request a specific file and the definition of the class changes later to have more than one, errors would occur.

Thanks! Steve

altmany commented 2 years ago

I think that it would be more user-friendly to keep the filename optional, and in the case of 2+ files either return an array of file objects, or generate an error to force the user to specify the desired filename. But for the common case of just a single file, it would be more user-friendly not to have to specify the filename. Let me know what you decide.

stevevanhooser commented 2 years ago

I think we should have the user specify; I agree it is less convenient for an expert, but it makes it more reliable that the person is really reading the file they think they are. The end coders (scientists) will have a variety of experience levels, and I can see a person reading the wrong file by accident (and not figuring it out for awhile).

If we returned an array, we'd have to tell them which file was which somehow (it could follow the file_list order).

So, I think it should be specified, for reliability at the expense of convenience.

altmany commented 2 years ago

Done

stevevanhooser commented 2 years ago

Wonderful, tests pass, closing issue