aws / aws-cli

Universal Command Line Interface for Amazon Web Services
Other
15.57k stars 4.13k forks source link

Feature Request: Validate DagNode Type enum members in Glue CreateScript #4692

Closed MrGossett closed 2 years ago

MrGossett commented 4 years ago

aws glue create-script will generate a script to use for a Glue Job given a description of DAG nodes and the edges between them. Nodes can be sources, sinks, or transforms.

API Docs for the CodeGenNode structure show that its NodeType attribute is required, and that it's a UTF-8 string. The description says "The type of node that this is."

aws glue create-script help reinforces this:

       --dag-nodes (list)
          A list of the nodes in the DAG.

       Shorthand Syntax:

          Id=string,NodeType=string,Args=[{Name=string,Value=string,Param=boolean},{Name=string,Value=string,Param=boolean}],LineNumber=integer ...

       JSON Syntax:

          [
            {
              "Id": "string",
              "NodeType": "string",
              "Args": [
                {
                  "Name": "string",
                  "Value": "string",
                  "Param": true|false
                }
                ...
              ],
              "LineNumber": integer
            }
            ...
          ]

However, I can't find anywhere in the docs or in CLI help the list of supported values for NodeType.

Here is a JSON file describing the input to aws glue create-script:

{
  "DagNodes": [
    {
      "Id": "source",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSource\"" }
      ]
    },
    {
      "Id": "transform",
      "NodeType": "ResolveChoice",
      "Args": [{ "Name": "specs", "Value": "[('amount_due', 'cast:double')]" }]
    },
    {
      "Id": "sink",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSink\"" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "source", "Target": "transform" },
    { "Source": "transform", "Target": "sink" }
  ],
  "Language": "PYTHON"
}

Generating a script using that JSON input is successful:

$ aws glue create-script --cli-input-json file://input.json
{
    "PythonScript": "import sys\nfrom awsglue.transforms import *\nfrom awsglue.utils import getResolvedOptions\nfrom pyspark.context import SparkContext\nfrom awsglue.context import GlueContext\nfrom awsglue.job import Job\n\n## @params: [JOB_NAME]\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\n\nsc = SparkContext()\nglueContext = GlueContext(sc)\nspark = glueContext.spark_session\njob = Job(glueContext)\njob.init(args['JOB_NAME'], args)\n## @type: DataSource\n## @args: [database = \"MyTestDatabase\", table_name = \"MyTestTableSource\", transformation_ctx = \"source\"]\n## @return: source\n## @inputs: []\nsource = glueContext.create_dynamic_frame.from_catalog(database = \"MyTestDatabase\", table_name = \"MyTestTableSource\", transformation_ctx = \"source\")\n## @type: ResolveChoice\n## @args: [specs = [('amount_due', 'cast:double')], transformation_ctx = \"transform\"]\n## @return: transform\n## @inputs: [frame = source]\ntransform = ResolveChoice.apply(frame = source, specs = [('amount_due', 'cast:double')], transformation_ctx = \"transform\")\n## @type: DataSink\n## @args: [database = \"MyTestDatabase\", table_name = \"MyTestTableSink\", transformation_ctx = \"sink\"]\n## @return: sink\n## @inputs: [frame = transform]\nsink = glueContext.write_dynamic_frame.from_catalog(frame = transform, database = \"MyTestDatabase\", table_name = \"MyTestTableSink\", transformation_ctx = \"sink\")\njob.commit()"
}

However, if I change the transformation from ResolveChoice to Map, I get an error.

Here is the updated input.json:

{
  "DagNodes": [
    {
      "Id": "source",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSource\"" }
      ]
    },
    {
      "Id": "transform",
      "NodeType": "Map",
      "Args": [{ "Name": "f", "Value": "my_custom_function" }]
    },
    {
      "Id": "sink",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "\"MyTestDatabase\"" },
        { "Name": "table_name", "Value": "\"MyTestTableSink\"" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "source", "Target": "transform" },
    { "Source": "transform", "Target": "sink" }
  ],
  "Language": "PYTHON"
}

Notice the only thing that has changed is the definition of the transform node.

The create-script action now returns an error:

$ aws glue create-script --cli-input-json file://input.json

An error occurred (InvalidInputException) when calling the CreateScript operation: Unknown NodeType Map in GenerateCode

Apparently Map is not supported, but ResolveChoice is supported.

It would be very helpful if there was documentation somewhere listing which transforms are supported by the aws glue create-script action.

MrGossett commented 4 years ago

I ran through a brute force search, updating my example script above with each of the transforms listed in the PySpark Transforms section of the Glue docs. Here are my results:

Transform Result
ApplyMapping supported ✅
DropFields supported ✅
DropNullFields supported ✅
ErrorsAsDynamicFrame unsupported ❌
Filter unsupported ❌
FlatMap unsupported ❌
Join supported ✅
Map unsupported ❌
MapToCollection unsupported ❌
Relationalize supported ✅
RenameField supported ✅
ResolveChoice supported ✅
SelectFields supported ✅
SelectFromCollection unsupported ❌
Spigot supported ✅
SplitFields supported ✅
SplitRows supported ✅
Unbox supported ✅
UnnestFrame unsupported ❌
bisdavid commented 4 years ago

@MrGossett, I've filed an internal ticket to pass this request on to the Glue doc writing team. They own the content that ends up in this particular CLI description. I'll ask them to flesh out the meaning of the NodeType element. Thanks for the feedback!

(V156194273)

MrGossett commented 4 years ago

@bisdavid any idea if an update to the Glue docs is planned?

kdaily commented 4 years ago

Hi @MrGossett, I confirmed that the Glue team is aware of the issue, but no ETA as to when it will be changed.

vivshri commented 3 years ago

I ran into the same.. this works for me

{
  "DagNodes": [
    {
      "Id": "DataSource0",
      "NodeType": "DataSource",
      "Args": [
        { "Name": "database", "Value": "mydatabase_source" },
        { "Name": "table_name", "Value": "mytable_source" },
        { "Name": "transformation_ctx", "Value": "DataSource0" }
      ]
    },
    {
      "Id": "Transform1",
      "NodeType": "CustomCode",
      "Args": [
               { "Name": "code", "Value":"pass" },
                {"Name": "className", "Value":"MyTransform"},
                {"Name": "dynamicFrameConstruction", "Value": "DynamicFrameCollection{\"DataSource0\":DataSource0}" },
                {"Name": "classification", "Value":"Transform"},
                {"Name": "dfc", "Value":"Transform1"},
                {"Name": "transformation_ctx", "Value":"Transform1"}
        ]
    },
    {
      "Id": "Transform0",
      "NodeType": "SelectFromCollection",
      "Args": [
       { "Name": "key", "Value": "list(Transform1.keys())[0]" },
       { "Name": "transformation_ctx", "Value": "Transform0" }
      ]
    },
    {
      "Id": "DataSink0",
      "NodeType": "DataSink",
      "Args": [
        { "Name": "database", "Value": "mydatabase_sink" },
        { "Name": "table_name", "Value": "mytable_sink" },
        { "Name": "transformation_ctx", "Value": "DataSink0" }
      ]
    }
  ],
  "DagEdges": [
    { "Source": "DataSource0", "Target": "Transform1" },
    { "Source": "Transform1", "Target": "Transform0" },
    { "Source": "Transform0", "Target": "DataSink0" }
  ],
  "Language": "PYTHON"
}
github-actions[bot] commented 2 years ago

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.