Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.25k stars 1.93k forks source link

[BUG]azure-cosmos-spark is not able to read array type field as string #40837

Open ian-liao-databricks opened 4 days ago

ian-liao-databricks commented 4 days ago

Describe the bug In a cosmos container, a field can have different structures for items. One of the best approach is to read the field as a string, as schema inference would do. However, the connector is not able to read a json array as a string.

Exception or Stack Trace NA

To Reproduce Create a Cosmos container with two items:

{
    "id": "test_item",
    "Data": [
        {
            "a": "x",
            "b": "y",
            "c": "y"
        }
    ]
}

and

{
    "id": "test_item_2",
    "Data": {
        "a": "b"
    }
}

Query this container from spark using schema inference. Null is returned for the first item's Data column.

Code Snippet

query = 'SELECT * FROM test_container'
read_config = {

        "spark.cosmos.accountEndpoint": 'https://*****.documents.azure.com:443/',

        "spark.cosmos.accountKey": '********',

        "spark.cosmos.database": 'test_database',

        "spark.cosmos.container": 'test_container',

        "spark.cosmos.read.customQuery": query,

        "spark.cosmos.read.inferSchema.enabled": "true"

    }
df = spark.read.format("cosmos.oltp").options(**read_config).load()
df.display()

Expected behavior Data column should return the JSON array below as a string

[
        {
            "a": "x",
            "b": "y",
            "c": "y"
        }
]

Screenshots

Screenshot 2024-06-26 at 12 00 32 PM

Setup (please complete the following information):

Additional context NA

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

FabianMeiswinkel commented 4 days ago

The behavior currently is as intended when schemas are conflicting. Customers who want to customize this can always define their own schema - or disable schema inference which will allow to get the exact JSON payload in the _rawBody column and do any custom schema inference on top of it.

ian-liao-databricks commented 4 days ago

Disable schema inference and providing a custom-defined schema doesn't help. I tried to define the schema like StructField('Data', StringType(), True)and still got nulls for arrays. It feels more like a bug because the user specifically wants a string.

Disable schema inference and NOT providing a custom schema works as a workaround. _rawBody returns the whole item as a string and the user can further parse it in Spark.

On Wed, Jun 26, 2024 at 2:00 PM Fabian Meiswinkel @.***> wrote:

The behavior currently is as intended when schemas are conflicting. Customers who want to customize this can always define their own schema - or disable schema inference which will allow to get the exact JSON payload in the _rawBody column and do any custom schema inference on top of it.

— Reply to this email directly, view it on GitHub https://github.com/Azure/azure-sdk-for-java/issues/40837#issuecomment-2192618608, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBGRFVHOH5OR2E65R4M5KM3ZJMTWRAVCNFSM6AAAAABJ6PO7VGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGYYTQNRQHA . You are receiving this because you authored the thread.Message ID: @.***>

FabianMeiswinkel commented 4 days ago

Thanks - the custom schema with StringType should work even for an array. I will reactive this GitHub issue to track investigating/fixing that part.

ian-liao-databricks commented 4 days ago

Sounds good, thanks!

On Wed, Jun 26, 2024 at 2:34 PM Fabian Meiswinkel @.***> wrote:

Thanks - the custom schema with StringType should work even for an array. I will reactive this GitHub issue to track investigating/fixing that part.

— Reply to this email directly, view it on GitHub https://github.com/Azure/azure-sdk-for-java/issues/40837#issuecomment-2192661996, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBGRFVH3FDN3Z5YCEP5XE3TZJMXUVAVCNFSM6AAAAABJ6PO7VGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGY3DCOJZGY . You are receiving this because you authored the thread.Message ID: @.***>