datajoint / datajoint-python

Relational data pipelines for the science lab
https://datajoint.com/docs
GNU Lesser General Public License v2.1
169 stars 85 forks source link

Construct a Diagram from multiple schemas #1190

Closed ethho closed 1 month ago

ethho commented 1 month ago

Feature Request

Problem

Currently, Diagram supports construction from a single source: Table, Schema, or module with a schema attribute. The use case is: when there are multiple connected (with foreign key references between their tables) schemas, and the user wishes to resolve (e.g. to view) these references. However, a user cannot construct a single Diagram from multiple schemas. The workaround

diagrams = [dj.Diagram(schema) for schema in schema_list]
full_diagram: dj.Diagram = sum(diagrams[1::], diagrams[0])

Is not efficient, since it scales linearly with respect the number of schemas (let this be $n$). This is because each Diagram constructor call loads Dependencies:

https://github.com/datajoint/datajoint-python/blob/77b75e9b11feaf35ce98e9f0afbf9289f33e08a2/datajoint/diagram.py#L103

which makes 2 queries to INFORMATION_SCHEMA tables to fetch primary and foreign keys:

https://github.com/datajoint/datajoint-python/blob/77b75e9b11feaf35ce98e9f0afbf9289f33e08a2/datajoint/dependencies.py#L99-L135

So $2n$ queries to INFORMATION_SCHEMA are made.

Requirements

Allow construction of a Diagram from multiple schema names. Construction from multiple schemas should be efficient, making $2 = O(1)$ queries (primary key and foreign key query) to INFORMATION_SCHEMA.

Justification

Adding support for this feature would enable users to efficiently view inter-schema references, even for large numbers of schemas.

Alternative Considerations

In my application, I workaround this by creating a mock class that we can pass to the Diagram constructor to get this behavior:

class AbstractDependencies(object):
    """
    Lightweight subclass of dj.connection.Dependencies that is meant
    to be passed to dj.Diagram constructor.
    """

    def __init__(self, schema_names: list[str], deps: dj.connection.Dependencies):
        super().__init__()

        if not schema_names:
            raise NotImplementedError("No schemas passed")
        # Set attributes for dj.Diagram.__init__
        self.schema_names = schema_names
        self.connection = deps._conn
        self.context = dict()

        # Edit Dependencies object in-place
        deps.clear()
        schema_dict: dict[str, dj.VirtualModule] = {
            schema: dj.VirtualModule(schema, schema, connection=self.connection)
            for schema in self.schema_names
        }
        deps._conn.schemas = schema_dict
        deps.load(force=True)
        assert deps._loaded, "Dependencies not loaded"
        self.nodes = deps.nodes
        self.database = self.schema_names[0]

    def to_diagram(self) -> dj.Diagram:
        diagram = dj.Diagram(self, self.context)
        for node in self.nodes:
            if any(
                node.startswith("`%s`" % database) for database in self.schema_names
            ):
                diagram.nodes_to_show.add(node)
        return diagram

def get_diagram_from_schemas(schemas: list[str]) -> dj.Diagram:
    deps: dj.connection.Dependencies = connection.dependencies
    abs_deps = AbstractDependencies(schemas, deps)
    return abs_deps.to_diagram()

This is $O(1)$ but hacky.

Environment

dimitri-yatsenko commented 1 month ago

dependencies.load() is a lazy function. If dependencies are already loaded, it returns immediately. It loads all the dependencies for all active schemas.

dimitri-yatsenko commented 1 month ago

You can join diagrams as diag1 + diag2. If you need to join a sequence of diagrams, you can use reduce(lambda a, b: a + b, diagram_list).

dimitri-yatsenko commented 1 month ago

After reviewing with @ethho, we cleared the disconnect. This feature is unnecessary. Closing.