[Proposal]AI assisted data mapper for Ballerina

sahanHe commented 4 months ago

Summary

Introducing an AI-powered Data Mapper designed to automate the task of correlating input and output record fields, thereby enhancing the developer experience.

Goals

Directly map fields in the input and output records.
Map fields in the input and output with intermediate operations.
Keeping the precision of the model high.

Non-Goals

Providing support for all the transformations available in the Datamapper.
Automate nested mappings where operands of the operator are output of another operation.

Motivation

Incorporating AI-assisted features into programming significantly boosts developer productivity, making it a compelling feature for any programming language. In integration tasks, mapping fields between input and output records manually can be cumbersome, especially for large datasets. Integration platforms have addressed this by introducing automated suggestions for mappings, enhancing the developer experience. Given the surge in AI advancements, a fully operational automatic data mapper not only serves as an attractive marketing tool but also enhances customer retention by streamlining the development process and saving time.

Success Metrics

The rate of correct mappings to the expected number of mappings. (>70%)
The rate of incorrect mappings to the expected number of mappings after filtration. (<10%)
Percentage of unmapped fields(ones that should have been mapped) to the expected number of mappings.(<40%)
Precision(>80%)
Execution time. (estimated time : 60s or below for 20 fields).

Description

The approach will employ prompt engineering methods to direct GPT 3.5/4.0 towards producing accurate mappings. Considering the GPT models' limited understanding of Ballerina syntaxes and the need to offer a universal data mapping platform for use across different products, we will adopt the following architecture.

Initially, input and output records will be converted into an intermediate JSON format to improve the GPT model's understanding. Where needed, comments will be added to fields through an API call to a Large Language Model (LLM). Subsequently, data mapping will be executed using natural language operations, like "DIRECT" and "SPLIT," easily interpreted by the LLM. To conclude, we'll validate the mappings by comparing them with the input and output types before methodically translating them into Ballerina syntax.

The possible mappings, consistent with our objectives and constraints, can be represented in a tree diagram. At the top level, these mappings are divided into two main categories: those that require an intermediate operation and those that do not. From there, we can further break down the categories into subproblems based on the specific characteristics of the mapping and the structural adjustments involved.

From the outlined mappings, direct mappings (those not requiring any operations) will be implemented in the first phase. To evaluate the model's effectiveness in handling direct mappings, following scenarios were tested using GPT-4.

Sample Id	Scenario	passed / failed

1.Mapping Without Operations 1.1Direct (one to one) mappings where type and field name are the same. 1 | Two non nested records | passed 2 | Two non nested records where the input contains five more fields compared to the output. | passed 3 | Two non nested records where output contains five more fields compared to the input. | passed 4 | Two non nested records where the output and input both contain five additional fields. | passed 5 | Two records where the output is nested. | passed 6 | Two records where the input is nested. | passed 7 | Two nested records. | passed 8 | Two records in the input mapping to a single record in the output. | passed 1.2. Direct(one to one) mappings where types are the same, but field names are syntactically different. 9 | Two non nested records. | passed 10 | Two non nested records where the input contains five more fields compared to the output. | passed 11 | Two non nested records where the output contains five more fields compared to the input. | passed 12 | Two non nested records where both the input and output contain five more fields that are not mapped. | passed 13 | Two records where the input contains nested records | passed 14 | Two records where the output contains nested records | passed 15 | Two records where both contain nested records. | passed 16 | Two records in the input mapping to a single record in the output. | passed 2.Mapping with Operations 17 | Mapping between input and output records where many operations are required to map the fields in input and output. | passed

Examples 1 through 16 assess the model's performance in data mapping scenarios that don't involve any intermediate operations. Example 17, on the other hand, explores the model's capability to generate mappings that include operations. These more complex mappings are planned for implementation in future iterations.

The cost estimations were not carried out for this proposal.

Alternatives

Generate the Ballerina code directly from the LLM
Pretrain LLM with Ballerina codes
Finetune LLM for data mapping

Testing

A testing pipeline is developed for evaluating the generation of intermediate mappings. This pipeline will run a series of test cases that cover the essential branches of the conceptual mapping tree, aiming to scrutinize the data mapper's effectiveness against predefined success metrics. The process will initiate by adding comments to relevant sections within the JSON files. Subsequently, the generated mappings will be cross-referenced with the anticipated outcomes to measure accuracy and performance.

To address the observed decline in output accuracy with extended input prompts, the number of fields in the records was capped at 20. This limitation is intended to be temporary, with plans to expand the field count in future iterations. Research indicates, as seen in reference [1], that when the input context exceeds 2,000 tokens, the model's attention towards the central part of the context significantly diminishes. Therefore, to maintain the total number of tokens within a manageable limit of approximately 4,000, it was necessary to restrict the number of fields to 20.

Risks Assumptions

Risks

New updates to the github copilot can make it more knowledgeable in Ballerina, making the need of an automated datamapper obsolete.
Increasing number of hallucinations for the intermediate representation.
Assumptions

In practical data mapping scenarios, only a subset of the available operations in the Ballerina data mapper will be used. OpenAI keeps the versioning of their modules without change.

References

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.

xlight05 commented 3 months ago

Please find the overall architecture below:

When the AutoMap button is pressed, if the user hasn't logged in already, user will be redirected to Asgardeo to complete the oauth process.
Once the user is logged in, the request will be sent to the backend along with datamapper-related details and access code.
Choreo gateway validates the access token with asgardeo
if the access token is valid, the request will be forwarded to the AI Generator backend.
Prompt which was augmented by the backend will be sent to OpenAI API and then it'll be sent back to the VSCode to render the data mapping.

xlight05 commented 3 months ago

As for the AI Generator Backend, As of now, it takes IR as the input payload and outputs ballerina code. The goal is to move away from this architecture and implement a backend API independent from Ballerina.

The backend should accept and return a language-independent json format. The Input model should contain details on Data groups, fields and their attributes. The output model should contain details on which fields map to which fields with required operations required for mappings.

When onboarding a language, we need to implement an input adapter and output adapter. Proposed Ballerina design is shown below. We can implement something similar for other languages too from the same backend if needed. (Ex - Synapse)

[1] - Ballerina code is parsed to Input Model with the help of compiler APIs. [2] - Input Model is sent to the backend. [3] - Output Model is sent back to the vscode. [4] - Output model Adapter converts the data mappings and generates the ballerina code.

As this proposed change is a breaking change, to facilitate this we need to think about how we handle such scenarios in future too.

As a solution, We introduce backend versions when we introduce something that is not backward compatible. This is supported by choreo deployment tracks.

sahanHe commented 1 month ago

To make the datamapper backend separate from Ballerina, the API will have the following JSON structure.

Request Payload

{
 JSON inputs;
 JSON inputMetadata;
 JSON output;
 JSON outputMetadata;
}

each field would have the following structures,

inputs

{
    <record_name>: {
          <field_name>:{
            "type":<field_type>,
            "comment":<comment>
          },
          .....
    },
}
.....

inputMetadata

{
    <record_name>: {
          <field_name>:{
            "parameterType":<parameter type>,
            "parameterName":<parameter name>,
            "isArrayType": <true if i's an array>,
            "type": <general type of the parameter(record, typereferece, etc..)>
            "fields" : <subfields if the field is a record>?
          },
          .....
    },
}

output

{
     <field_name>:{
        "type":<field_type>,
        "comment":<comment>
     },
    .....
}
.....

outputMetadata

{
      <field_name>:{
         "parameterType":<parameter type>,
         "parameterName":<parameter name>,
         "isArrayType": <true if i's an array>,
         "type": <general type of the parameter(record, typereferece, etc..)>
         "fields" : <subfields if the field is a record>?
      },
     .....
}

The output payload would take the following form.

{
    JSON mapping;
}

mapping JSON will have the following structure,

{ 
  <field Id>":{
    "operation":<operation name>,
    "target":"<target field name of the output>",
    "parameters": <array of parameters in order of the appearence>
   }

Following is a example mapping,

{
   "reservationId":{
      "operation":"DIRECT",
      "target":"reservationId",
      "parameters":[
         "bookingRequest.bookingId"
      ]
   },
   "guest":{
      "firstName":{
         "operation":"DIRECT",
         "target":"firstName",
         "parameters":[
            "bookingRequest.guest.firstName"
         ]
      },
      "lastName":{
         "operation":"DIRECT",
         "target":"lastName",
         "parameters":[
            "bookingRequest.guest.lastName"
         ]
      },
      "email":{
         "operation":"DIRECT",
         "target":"email",
         "parameters":[
            "bookingRequest.guest.email"
         ]
      },
      "phoneNumber":{
         "operation":"DIRECT",
         "target":"phoneNumber",
         "parameters":[
            "bookingRequest.guest.phoneNumber"
         ]
      }
   },
   "bookingDetails":{
      "hotelId":{
         "operation":"DIRECT",
         "target":"hotelId",
         "parameters":[
            "bookingRequest.bookingDetails.hotelId"
         ]
      },
      "roomType":{
         "operation":"DIRECT",
         "target":"roomType",
         "parameters":[
            "bookingRequest.bookingDetails.roomType"
         ]
      },
      "checkInDate":{
         "operation":"DIRECT",
         "target":"checkInDate",
         "parameters":[
            "bookingRequest.bookingDetails.checkInDate"
         ]
      },
      "checkOutDate":{
         "operation":"DIRECT",
         "target":"checkOutDate",
         "parameters":[
            "bookingRequest.bookingDetails.checkOutDate"
         ]
      },
      "numberOfGuests":{
         "operation":"DIRECT",
         "target":"numberOfGuests",
         "parameters":[
            "bookingRequest.bookingDetails.numberOfGuests"
         ]
      },
      "kidsAges":{
         "operation":"DIRECT",
         "target":"kidsAges",
         "parameters":[
            "bookingRequest.bookingDetails.kidsAges"
         ]
      },
      "numberOfKids":{
         "operation":"LENGTH",
         "target":"numberOfKids",
         "parameters":[
            "bookingRequest.bookingDetails.kidsAges"
         ]
      }
   }
}

ballerina-platform / ballerina-library