dremio-professional-services / dremio-cloner

27 stars 21 forks source link

Dependency resolving causes an infinite loop on a valid VDS definition #35

Open jeff-99 opened 1 year ago

jeff-99 commented 1 year ago

In one of our systems we have the following VDS that causes an infinite loop in dependency resolving.

The VDS' name is Staging.TOS.Container.Container and the query roughly looks like this:

WITH CONTAINER AS ( SELECT ... )
SELECT *
FROM CONTAINER
WHERE X = 1 

This gave a python recursion depth exception on processing the VDS. Changing the reference to the following solved the issue:

WITH CONTAINER_BASE AS ( SELECT ... )
SELECT *
FROM CONTAINER_BASE
WHERE X = 1 

The initial query is a perfectly valid query so should IMO not cause an issue in syncing the script to source control

mxmarg commented 1 year ago

Hi Jeff, I tried to reproduce the behaviour, but in my case Cloner was able to read the VDS successfully, which included getting the parent table as a dependency:

   "vds": [
        {
            "accessControlList": {},
            "entityType": "dataset",
            "fields": [...],
            "id": "3c070243-5e52-43f4-b448-b656b314872d",
            "owner": {...},
            "path": [
                "Staging",
                "TOS",
                "Container",
                "Container"
            ],
            "sql": "WITH CONTAINER AS ( SELECT * FROM sys.nodes )\nSELECT *\nFROM CONTAINER\nWHERE name = 'node'",
            "sqlContext": [
                "Staging",
                "TOS",
                "Container"
            ],
            "type": "VIRTUAL_DATASET"
        }
    ]

Can you be more specific as to what circumstances (e.g. during read/write) this error happened and provide a stack trace?

datocrats-org commented 9 months ago

I would suggest verifying the sqlContext at which the query was drafted and always using the full dataset path in the sql to avoid any confusion. The name of the table (container) if it repeats twice in the path name of the dataset could be resolved to different children at different locations depending on what the sqlContext was when either SQL was written.

anything.container.container and anything.container would both be matched to container.

It would be a good feature for cloner to automatically update the name of each dataset in the SQL to be the fully qualified path and to nullify the sqlContext in all cloned VDS. During a clone this is useful to avoid name conflicts. I think it would need to detect the relative location of each dataset to do this.