apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
61.96k stars 13.59k forks source link

Dataset get_or_create API fail to create a dataset with the same table_name but different schema #30377

Open luizcapu opened 2 days ago

luizcapu commented 2 days ago

Bug description

get_or_create Dataset API endpoint breaks with 500 - Internal Server error if there is any existing dataset with the same table_name, regardless of the dataset schema.

This is not the behaviour we observe on other endpoints, such as a POST /dataset from where users are able to create datasets with the same table_name, as long as they belong to either a different database_id or a different schema.

Furthermore, if there is already an existing dataset for a given table_name and a user is trying to create a new dataset with the same table_name but different schema, the API will return 200, which is a false positive.

How to reproduce the bug

Case 1 - False Positive

  1. Go to the datasets page
  2. Pick any existing dataset name and prepare a payload as follows (example using the users datasets)
{
  'table_name': 'users',
  'schema': 'other',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. Note how the API will return with a 200 pointing to the existing public.users dataset ID. The new dataset is not created.

Case 2 - Internal Server Error

  1. Create 2 or more datasets with the same table_name and different schemas (either via UI or create dataset API)
  2. Try to create a new dataset. Again, with same table_name but a different schema. Payload example:
{
  'table_name': 'users',
  'schema': 'any_new_schema_name',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. Note how the API will return with a 500 - Internal Server Error. The new dataset is not created.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

dosubot[bot] commented 2 days ago

Hey @luizcapu! đź‘‹ I'm here to help with bugs, answer questions, and assist you while you wait for a human maintainer. Let's squash those bugs together!

The get_or_create Dataset API endpoint returns a 500 - Internal Server Error when trying to create a dataset with the same table_name but a different schema because the current implementation only checks for the existence of a table by its table_name and database_id. It does not account for differences in the schema. If a table with the same table_name already exists, it returns the existing table's ID without considering the schema, leading to potential conflicts or errors [1].

The relevant code snippet is:

@expose("/get_or_create/", methods=("POST",))
@protect()
@safe
@statsd_metrics
@event_logger.log_this_with_context(
    action=lambda self, *args, **kwargs: f"{self.__class__.__name__}"
    f".get_or_create_dataset",
    log_to_statsd=False,
)
def get_or_create_dataset(self) -> Response:
    """Retrieve a dataset by name, or create it if it does not exist.
    ---
    post:
      summary: Retrieve a table by name, or create it if it does not exist
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/GetOrCreateDatasetSchema'
      responses:
        200:
          description: The ID of the table
          content:
            application/json:
              schema:
                type: object
                properties:
                  result:
                    type: object
                    properties:
                      table_id:
                        type: integer
        400:
          $ref: '#/components/responses/400'
        401:
          $ref: '#/components/responses/401'
        422:
          $ref: '#/components/responses/422'
        500:
          $ref: '#/components/responses/500'
    """
    try:
        body = GetOrCreateDatasetSchema().load(request.json)
    except ValidationError as ex:
        return self.response(400, message=ex.messages)
    table_name = body["table_name"]
    database_id = body["database_id"]
    if table := DatasetDAO.get_table_by_name(database_id, table_name):
        return self.response(200, result={"table_id": table.id})

The code only checks for the existence of a table by table_name and database_id:

if table := DatasetDAO.get_table_by_name(database_id, table_name):
    return self.response(200, result={"table_id": table.id})

This logic does not consider the schema, which can lead to a 500 - Internal Server Error if there is a conflict or inconsistency when trying to create a dataset with the same table_name but a different schema [1].

Additionally, the test case test_create_dataset_same_name_different_schema in the integration tests demonstrates that creating datasets with the same table_name but different schema is a valid scenario, which the current implementation does not handle correctly [2].

To continue talking to Dosu, mention @dosu.