datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
170 stars 85 forks source link

Convenience functions to insert / fetch when an attach field is in table definition #1156

Open MaxFBurg opened 8 months ago

MaxFBurg commented 8 months ago

Feature Request

Problem

When inserting into a table that has a field result : attach@minio, the insert table method expects a file path. Similarly, fetch stores a file and returns a file path. This is often times inconvenient, because (i) the data saved in the file is required as an object in the python script one is executing, and (ii) the saved / downloaded files remains on local storage even after the script terminated.

Requirements

Possible solution: Introduce a parameter to insert that automatically saves the data that should be inserted to a file, inserts it into the table, and then removes that file. Similarly, fetch could save the file, and return the file / data loaded within the python script.

Justification

See problem section

Alternative Considerations

Currently I am using an AttachMixin as a workaround, i.e. my table would be defined as class MyTable(AttachMixin, dj.Computed). The mixin could be the code basis for the feature I suggested, although it would need a little bit of improvement.

class AttachMixin:

    def attach_insert(self, keys: Iterable[Dict[str, Any]], attach_keys: Iterable[str]) -> None:
        if not isinstance(attach_keys, list):
            raise ValueError("attach_keys must be a list")

        with tempfile.TemporaryDirectory(dir=os.environ.get("TMP", ".")) as temp_dir:
            for (i, key), ak in product(enumerate(keys), attach_keys):
                path = os.path.join(temp_dir, create_random_str() + ".pkl")

                with open(path, "wb") as f:
                    pickle.dump(key[ak], f)
                keys[i][ak] = path

            self.insert(keys)

    def attach_insert1(self, key: Dict[str, Any], attach_keys: Iterable[str]) -> None:
        self.attach_insert([key], attach_keys)

    def attach_fetch(
        self,
        *attrs: str,
        key: Optional[Dict[str, Any]] = None,
        **kwargs,
    ) -> Union[Dict[str, Any], List]:
        key = key or {}

        with tempfile.TemporaryDirectory(dir=os.environ.get("TMP", ".")) as temp_dir:
            ret = (self & key).fetch(*attrs, download_path=temp_dir, **kwargs)  # array, list[dict]

            if isinstance(ret, dict):
                ret = self._load_from_dict(ret)

            elif isinstance(ret, Iterable):
                ret = np.array(ret)

                for i, value in enumerate(ret):
                    if isinstance(value, dict):
                        ret[i] = self._load_from_dict(value)

                    elif self._is_pkl_path(value):
                        with open(value, "rb") as f:
                            ret[i] = pickle.load(f)

                    else:
                        raise NotImplementedError(f"Value {value} is not a dict or a pkl path")

            elif self._is_pkl_path(ret):
                with open(ret, "rb") as f:
                    ret = pickle.load(f)

            else:
                raise NotImplementedError(f"Return value {ret} is not a dict, Iterable, or a pkl path")

        return ret

    def attach_fetch1(
        self,
        *attrs: str,
        key: Optional[Dict[str, Any]] = None,
        **kwargs,
    ) -> Union[Dict[str, Any], List]:
        ret = self.attach_fetch(*attrs, key=key, **kwargs)
        if len(ret) > 1:
            raise dj.DataJointError(f"fetch1 should only return one tuple. {len(ret)} tuples were found")
        return ret[0]

    def _load_from_dict(self, d: dict[str, str]) -> dict[str, Any]:
        for key, value in d.items():
            if self._is_pkl_path(value):
                with open(value, "rb") as f:
                    d[key] = pickle.load(f)
        return d

    def _is_pkl_path(self, value):
        return (
            isinstance(value, str) and value.endswith(".pkl") and os.path.isfile(value)
        )

Related

This issues might be (loosely) related: https://github.com/datajoint/datajoint-python/issues/1109 https://github.com/datajoint/datajoint-python/issues/1099

If you think such a feature could be helpful to be included in datajoint, I would be happy to help implementing it.

ttngu207 commented 7 months ago

I think you're suggesting some sort of a user-provided functions on insert and on fetch for attach type. This is very much the idea of DataJoint's AttributeAdapter feature - see here

With that feature, you can define a new DataJoint datatype (e.g. attack_pkl or something like that).

See some examples here: