datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
168 stars 84 forks source link

Support meta information for `filepath` attributes. #792

Open ttngu207 opened 4 years ago

ttngu207 commented 4 years ago

An example use-case is to work with NWB files more elegantly. For a particular NWB object, we need to store 2 things: object_id - varchar(36) and nwb_file - filepath@store Currently dj.AttributeAdapter does not support this, so a workaround is to use longblob and store a tuple:

class NWBObjectAdapter(dj.AttributeAdapter):
    attribute_type = 'longblob'
    # attribute_type = "('varchar(36)', 'filepath@store')"

    def put(self, nwbobj):
        # take any arbitrary NWB object and extract a tuple of: (object_id, nwb_filepath)
        nwb_fp = nwbobj.container_source
        obj_id = nwbobj.obj_id_to_store  # new addon field to the nwbobj to indicate which object to store
        return obj_id, nwb_fp  # this is the tuple that is stored in DB

    def get(self, stored_tuple):
        obj_id, nwb_fp = stored_tuple
        io = pynwb.NWBHDF5IO(nwb_fp, mode='r')
        nwbf = io.read()
        return nwbf.objects[obj_id]

but this workaround implementation would not support filepath@store type, which is crucial for working with NWB objects

ttngu207 commented 4 years ago

Proposed solution number 1: Special feature for filepath@store (and potentially attach@store to have meta_information attached to it.

Example of how that may look like:

@schema
class NWBRaw(dj.Manual):
    definition = """
    -> Session
    ---
    nwbfile: filepath@store
    """
NWBRaw.insert1({**session_key, 'nwbfile': (nwb_filepath, {'object_id': obj_uuid})})
fp, meta = (NWBRaw & session_key).fetch1('nwbfile', fetch_meta=True)

Example dj.AttributeAdapter for NWB object with this feature:

class NWBObjectAdapter(dj.AttributeAdapter):
    attribute_type = 'filepath@store'
    def put(self, nwbobj):
        nwb_fp = pathlib.Path(nwbobj.container_source)
        obj_id = nwbobj.obj_id_to_store  
        return nwb_fp, dict(object_id=obj_id)
    def get(self, filepath):  # returned as a tuple: (filepath, meta)
        nwb_fp, meta_dict = filepath
        io = pynwb.NWBHDF5IO(filepath.as_poxis(), mode='r')
        nwbf = io.read()
        return nwbf.objects[meta_dict['object_id']]
dimitri-yatsenko commented 4 years ago

The fetch_meta argument in fetch may be unnecessary. If the user inserts the filepath with metadata, it will come back with metadata as a tuple. That would be cosistent and intuitive: you always fetch what you insert.

eywalker commented 4 years ago

Nah, that just won't be a reasonable interface as just having one entry with meta can disrupt it. We do need to have a clean separation for when meta is returned vs not.

dimitri-yatsenko commented 4 years ago

The separation is clean. If you insert a tuple, you fetch it back. It's simple, does not need to be explained. Users get back what they insert. If they choose to insert some records with metadata and some without, that's what they will get back too — straightforward and transparent.

eywalker commented 4 years ago

It will just be much nicer to be able to fully expect if you are going to get a tuple vs list of strings precisely corresponding to the filepath. Meta provision should really be optional with no chance of disrupting the main usage of obtaining back the filepath.

eywalker commented 4 years ago

there is a clear separation between actual data and metadata, and I find it completely consistent we treat them separately. Let's proceed with fetch_meta based behavior and discuss further as we see the examples.

dimitri-yatsenko commented 4 years ago

Agreed. Yes, the option of skipping the metadata will be helpful.

dimitri-yatsenko commented 4 years ago

Perhaps by default, fetch_meta=None, which means fetch whatever you inserted. fetch_meta=True returns tuples always. fetch_meta=False returns the paths only.

eywalker commented 4 years ago

Hm, potentially. Although I'd really think it's enough to offer True/False behavior defaulting to False.

dimitri-yatsenko commented 4 years ago

Then this would introduce the inconsistency that you insert one thing and fetch another. The default behavior needs to be most consistent.

eywalker commented 4 years ago

You are inserting metadata along with the data, and for that to be treated differently sounds just fine to me. It's not quite the same situation as inserting a tuple and expecting tuple back for a blob.

dimitri-yatsenko commented 4 years ago

Is there a good reason to treat metadata differently? It's all just data. Special behaviors require extra documentation and explanations. Fetching what is inserted is consistent behavior through all other cases. If the user does not like it, they will look for the feature to skip the metadata.

dimitri-yatsenko commented 4 years ago

Here is a more complete example using the custom data type for NWB objects.

class NWBTrace(dj.AttributeAdapter):
    """
    custom datajoint attribute type for NWB objects in NWB files
    """

    attribute_type = 'filepath@store'

    def put(self, nwbobj):
        nwb_path = nwbobj.container_source
        return nwb_path, nwbobj.trace_id_to_store

    def get(self, filepath):  # returned as a tuple: (filepath, meta)
        nwb_path, object_id = filepath
        return pynwb.NWBHDF5IO(nwb_path, mode='r').read()[object_id]

nwb_trace = NWBTrace()

@schema
class Ephys(dj.Manual):
    definition = """
    -> Session
    ---
    trace: <nwb_trace>    
    """

...

Ephys.insert1({**session_key, 'trace': (nwb_filepath, obj_uuid))
trace = (Ephys & session_key).fetch1('trace')