awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
642 stars 304 forks source link

Parse wrong first record when using create_dynamic_frame_from_options for json file. #68

Open jimmymaise opened 4 years ago

jimmymaise commented 4 years ago

Parse wrong first record when using create_dynamic_frame_from_options for json file.

            s3_json_file = "s3a://bucket_name/folder_name/sample_glue_test.json"

            data_source_frame = glueContext.create_dynamic_frame_from_options(
                connection_type='s3',
                connection_options={'paths': [s3_json_file]},
                format='json')

            data_source_frame.show()

Input file (sample_glue_test.json):

[
  {
    "event_date": "20200629",
    "event_params": [
      {
        "key": "firebase_screen_id",
        "value": {
          "string_value": null,
          "int_value": "888",
          "float_value": null,
          "double_value": null
        }
      },
      {
        "key": "firebase_event_origin",
        "value": {
          "string_value": "auto",
          "int_value": null,
          "float_value": null,
          "double_value": null
        }
      }
    ],
    "user_id": "123",
    "platform": "IOS"
  },
    {
    "event_date": "20200729",
    "event_params": [
      {
        "key": "firebase_screen_id",
        "value": {
          "string_value": null,
          "int_value": "999",
          "float_value": null,
          "double_value": null
        }
      },
      {
        "key": "firebase_event_origin",
        "value": {
          "string_value": "manual",
          "int_value": null,
          "float_value": null,
          "double_value": null
        }
      }
    ],
    "user_id": "456",
    "platform": "Android"
  }
]

Expected Output

{"event_date": "20200629", "event_params": [{"key": "firebase_screen_id", "value": {"string_value": null, "int_value": "888", "float_value": null, "double_value": null}}, {"key": "firebase_event_origin", "value": {"string_value": "auto", "int_value": null, "float_value": null, "double_value": null}}], "user_id": "123", "platform": "IOS"}

{"event_date": "20200729", "event_params": [{"key": "firebase_screen_id", "value": {"string_value": null, "int_value": "999", "float_value": null, "double_value": null}}, {"key": "firebase_event_origin", "value": {"string_value": "manual", "int_value": null, "float_value": null, "double_value": null}}], "user_id": "456", "platform": "Android"}

Actual Output

{"key": "firebase_screen_id", "value": {"string_value": null, "int_value": "888", "float_value": null, "double_value": null}}
{"key": "firebase_event_origin", "value": {"string_value": "auto", "int_value": null, "float_value": null, "double_value": null}}

{"event_date": "20200729", "event_params": [{"key": "firebase_screen_id", "value": {"string_value": null, "int_value": "999", "float_value": null, "double_value": null}}, {"key": "firebase_event_origin", "value": {"string_value": "manual", "int_value": null, "float_value": null, "double_value": null}}], "user_id": "456", "platform": "Android"}
khalidjaz commented 3 years ago

+1

mikepro123 commented 3 years ago

@khalidjaz AWS supported me this case several month ago. Now, it works well. Here it is

Dear AWS customer,

Thank you for contacting AWS Support, my name is Safari, I will assist with your case.

I am sorry to see that you've had trouble while reading json data from S3 using create_dynamic_frame_from_options() function.

I was able to reproduce the issue using the sample data that you provided. I don't believe the issue to be a bug, but rather, it is the result of the JSON classifier internally used to infer the schema from your data when you invoke the function. As noted in our public documentation, even if you were to crawl your data prior to querying it, classifiers built into AWS Glue may not always infer the exact data schema, in which case, you'd have to supply a custom classifier [1].
Nonetheless, if you still want to use the same function, you can try getting around the limitation by supplying format_options={'jsonPath': '$.*'} as one of the function parameters. As explained in the documentation link below [1], "$.*" JSON path instructs the classifier to use all objects in your json data. 
Your data_source_frame object should become:
data_source_frame = glueContext.create_dynamic_frame_from_options(
                connection_type='s3',
                connection_options={'paths': [s3_json_file]},
                format='json', format_options={'jsonPath': '$.*'})

Additionally, you may also get around the issue by reading the data with a DataFrameReader as shown in the code snippet below:

data_source_frame = glueContext.read.json(s3_json_file, multiLine=True)

I hope this helps. Please let me know if I can provide you with any other assistance.

References
[1]: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
khalidjaz commented 3 years ago

@mikepro123 That works - thank you!