Azure / data-api-builder

Data API builder provides modern REST and GraphQL endpoints to your Azure Databases and on-prem stores.
https://aka.ms/dab/docs
MIT License
852 stars 172 forks source link

[Known Issue] Detect JSON columns in Azure SQL #444

Open yorek opened 2 years ago

yorek commented 2 years ago

We should be able to automatically detect if a column contains JSON data and avoid escaping it in the output. Right now if I have the following table:

create table s003b.todos
(
    id int not null constraint pk__s003b_todos primary key default (next value for s003b.globalId),
    category_id varchar(3) collate Latin1_General_BIN2 not null constraint fk__s003b_todos__s003b_categories references s003b.categories(id),
    title nvarchar(100) not null,
    [description] nvarchar(1000) null,
    tags nvarchar(max) null check ( isjson(tags)= 1 ),
    completed bit not null default(0)
)

and I store JSON data in the "tags" column, I get the following output:

{ data": {
        "todos": {
            "items": [
                {
                    "id": 10000,
                    "title": "item-001",
                    "completed": false,
                    "tags": "[{\"tag\":\"red\"}]",
                    "category": {
                        "id": "f",
                        "category": "Family"
                    }
                }

where tags contains encoded JSON...even if the content is valid JSON itself. Right now Azure SQL DB doesn't have a native JSON data type, but we can check if a column contains JSON data by checking the check constraint that should have been created to allow only JSON data to be inserted (see table definition above). This query can return which columns should be treated as JSON:

select 
    s.[name] as [schema_name],
    t.[name] as [table_name],
    c.[name] as [columne_name],
    ck.[definition],
    case when (ck.[definition] like '%isjson(/[' + trim(c.[name]) + '/])=(1)%' escape '/') then 1 else 0 end as [isjson]
from 
    sys.check_constraints ck
inner join
    sys.tables t on ck.[parent_object_id] = t.[object_id]
inner join
    sys.columns c on t.[object_id] = c.[object_id] and ck.parent_column_id = c.[column_id]
inner join
    sys.schemas s on t.[schema_id] = s.[schema_id]
jarupatj commented 2 years ago

@gledis69 can you please take a look at this issue?

gledis69 commented 2 years ago

@yorek what would be the desired output for "tags" in this scenario? The " need to be escaped otherwise they will terminate the string that represents the json. Another question that comes to mind is what the use case scenario is here. Most libraries that work with string should be able to handle escaped quotes.

yorek commented 2 years ago

@gledis69 given that in the tags columns there can only be valid JSON content, the expected output would be:

{
  "data": {
    "todos": {
      "items": [
        {
          "id": 10000,
          "title": "item-001",
          "completed": false,
          "tags": [
            {
              "tag": "red"
            }
          ],
          "category": {
            "id": "f",
            "category": "Family"
          }
        }
      }
    }
  }
}

basically: just pass the JSON as is, without encoding it.

Aniruddh25 commented 1 year ago

Not need to have support for Filter, Order but for Mutation for JSON type

Aniruddh25 commented 1 year ago

From @gledis69 : This turned out to be more involved that I imagined, The core problem is that our DbDataAdapter approach that we use to detect column types just doesn't detect json-s, so we'd have to rewrite column type detection to querying the db and analyzing the response similarly to what we do to detect fk-s.

Aniruddh25 commented 1 year ago

Moving to Jan2023 with regards to the new finding

Aniruddh25 commented 1 year ago

From @gledis69: json columns need to be handled either by: 1) changing the way we determine query types by directly querying the db vs using the C# interface we are using rn since that interface doesn't detect the json. This is a bunch of work since the queries for each db need to be written and then the types need to be mapped since the types have different names across the db-s. 2) "hack it" and only write db queries to get the json columns from the tables and override the results we get from the data interface for the json columns.  

2 is probably more feasible. Part of me doesn't like that since it is very particular and not a general scalable solution, but then realistically we already cover most of the commonly used types so it is not very likely we would need to do this for other types in the future hopefully

yorek commented 1 year ago

I think that 1 is something that will be needed anyway at some point. AFAIK we're using the ADO.NET native support for discovering metadata, but that will not work, fo example, on Synapse. Maybe we can start with 2, but let's plan for 1

Aniruddh25 commented 1 year ago

Not enough time to do this with competing deadlines in Jan203. I agree let's do 1 since we need to support Synapse anyway and its unfortunate that ADO.NET doesn't have that support yet.

aaronpowell commented 1 year ago

@Aniruddh25 do you know if adding this support is on the roadmap for the ADO.NET team? I can try and find a contact if needed.

aaronpowell commented 1 year ago

I was reflecting on this while working on some Blazor samples and how we'd best go about representing the JSON type in GraphQL.

Presently, we end up with a generated GraphQL type like so:

type Todos {
  id: Int!
  title: String!
  description: String!
  completed: Boolean!
  tags: String!
}

Looking at the original structure of tags we want it to map to a type like this:

type Tag {
  tag: String!
}

And that'd see us with a field tags: [Tag]! on the Todos type.

This isn't practical though, as we can't deserialize every JSON field to work out what the "shape" is to then know what GraphQL type to generate should look like, and what happens when you've got optional fields in the JSON, do we have to deserialize a subset of them to work out what the possibilities look it?

Proposal - Custom Scalar types

Back to my idea, we could use custom scalar types for these JSON fields, generating a "unique type" for each JSON field, meaning out Todos type would now look like this:

scalar Todo_Tags

type Todos {
  id: Int!
  title: String!
  description: String!
  completed: Boolean!
  tags: Todo_Tags!
}

Admittedly, this won't really do anything different at a server level, that'll still return a string field to the client, but what it means is that as a client we can intercept it and handle it.

If you're building a .NET client with Strawberry Shake you can define custom serialization for a particular scalar type. urql can also support it and it's possible with Apollo (although not well supported). But also it won't have any real breaking downstream impacts as the change is only masking that it's a String.

Alternatively it could be set as a JSON scalar type, but that would make it harder to have different serialization settings for different fields, in the case where you have multiple JSON columns in a SQL database.

aaronpowell commented 1 year ago

Because curiosity meant that I needed to dig into this more I decided to look at what we need to do to identify JSON columns.

PostgreSQL (and I expect MySQL but didn't test) should be reasonably easy to detect as they have a column type of json that is returned when using the conn.GetSchema("Columns") query:

JSON column data type

Azure SQL is a little harder as it isn't a data type but it's a constraint on the table that we can query for and then find the column that the constraint is applicable to. Here's a SQL query that we can execute:

select schema_name(t.schema_id) + '.' + t.[name]  as [table],
    col.[name] as column_name,
    con.[definition]
from sys.check_constraints as con
    left outer join sys.objects as t
        on con.parent_object_id = t.object_id
    left outer join sys.all_columns as col
        on con.parent_column_id = col.column_id
        and con.parent_object_id = col.object_id
where lower(con.[definition]) like '%isjson%'
and con.is_disabled = 0

With this we could add another property to ColumnDefinition of IsJson (or similar) and use that as part of the codegen for the GraphQL engine.