amosproj / amos2023ws05-pipeline-config-chat-ai

MIT License
1 stars 0 forks source link

Q[2, a - g] Query testing, optimization and verification #63

Closed AviKatziuk closed 9 months ago

AviKatziuk commented 11 months ago

User story

  1. As SDs.
  2. We want to test and improve the bot's capabilities.
  3. So that it can successfully handle a wide range of requests.

Testing week 2 queries

For every query you must add "I would like to use RTDIP component... (and then the query)"

Q[2,a]: I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to EVMContractDestination

Q[2,b]: I would like to use RTDIP components to read from PythonDeltaSharingSource , transform using SEMJsonToPCDMTransformer then write to SparkDeltaDestination

Q[2,c]: I would like to use RTDIP components to read from SparkDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaDestination

Q[2,d]: I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaEventhubDestination

Q[2,e]: I would like to use RTDIP components to read from PythonDeltaSource , transform using PandasToPySparkTransformer then write to SparkDeltaMergeDestination

Q[2,f]: I would like to use RTDIP components to read from SparkDeltaSource , transform using SEMJsonToPCDMTransformer then write to SparkKafkaEventhubDestination

Acceptance criteria

  1. Run the queries and document the given output and response speed as a comment here.
  2. Run each query in a fresh session and ensure caching is off.
  3. If you think the queries are correct, send them to Shell.
  4. If the query is incorrect, edit the RAG to improve the results.
  5. If the changes to the RAG are large enough to constitute a defined task, talk to Avi to set up one on the board.

Definition of done (DoD)

Feature DoD:

  1. Code review has been completed and code has been merged.
  2. User interaction tests pass on all major browsers.

Sprint Release DoD:

  1. Project builds, deploys, and tests successfully.

Project Release Definition of Done

  1. User interaction tests pass on all major browsers.
  2. Design documentation has been updated

DoD general criteria

AviKatziuk commented 10 months ago

Q [2,a] I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to EVMContractDestination by @lyndanajjar

from rtdip.pipeline import Pipeline
from rtdip.components.sources.python.delta import PythonDeltaSource
from rtdip.components.destinations.spark.evm_contract import EVMContractDestination

# Create a pipeline

pipeline = Pipeline()

# Add a Delta source component

delta_source = PythonDeltaSource(table_name='your_delta_table')
pipeline.add_component(delta_source)

# Add a destination component

evm_contract_destination = EVMContractDestination(contract_id='your_contract_id', connection_string='your_connection_string')
pipeline.add_component(evm_contract_destination)

# Connect the components

pipeline.connect(delta_source, evm_contract_destination)

# Run the pipeline

pipeline.run()
AviKatziuk commented 10 months ago

Q [2,b] I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to EVMContractDestination by @lyndanajjar

from rtdip.pipeline import Pipeline
from rtdip.components.sources.python.delta_sharing import PythonDeltaSharingSource
from rtdip.components.transformers.spark.sem_json_to_pcdm import SEMJsonToPCDMTransformer
from rtdip.components.destinations.spark.delta import SparkDeltaDestination

# Create the pipeline

pipeline = Pipeline()

# Add the components to the pipeline

source = PythonDeltaSharingSource("path_to_delta_table")
transformer = SEMJsonToPCDMTransformer()
destination = SparkDeltaDestination("path_to_delta_table")

pipeline.add_component(source)
pipeline.add_component(transformer)
pipeline.add_component(destination)

# Connect the components in the pipeline

source.connect(transformer)
transformer.connect(destination)

# Execute the pipeline

pipeline.execute()
AviKatziuk commented 10 months ago

Q [2,c] I would like to use RTDIP components to read from SparkDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaDestination by @lyndanajjar

from rtdip_sdk.pipelines.sources import SparkDeltaSource
from rtdip_sdk.pipelines.transformers import BaseRawToMDMTransformer
from rtdip_sdk.pipelines.destinations import SparkKafkaDestination

def pipeline():
    source_df = SparkDeltaSource(
        spark=spark,
        table_name="delta_table_name"
    ).read_batch()

    transform_df = BaseRawToMDMTransformer(
        spark=spark,
        data=source_df
    ).transform()

    SparkKafkaDestination(
        data=transform_df,
        options={
            "kafka.bootstrap.servers": "localhost:9092",
            "topic": "output_topic"
        }
    ).write_batch()

if __name__ == "__main__":
    pipeline()
AviKatziuk commented 10 months ago

Q [2,d] I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaEventhubDestination by @Nahrain1

from rtdip.pipeline import Pipeline
from rtdip.components.sources.python.delta import PythonDeltaSource
from rtdip.components.transformers.python.base_raw_to_mdm import BaseRawToMDMTransformer
from rtdip.components.destinations.spark.kafka_eventhub import SparkKafkaEventhubDestination

# Create the pipeline
pipeline = Pipeline()

# Add the source component
source = PythonDeltaSource(
    table_name="your_delta_table_name"
)
pipeline.add_component(source)

# Add the transformer component
transformer = BaseRawToMDMTransformer()
pipeline.add_component(transformer)

# Add the destination component
destination = SparkKafkaEventhubDestination(
    broker_list="your_kafka_broker_list",
    topic="your_kafka_topic"
)
pipeline.add_component(destination)

# Connect the components
pipeline.connect(source, transformer)
pipeline.connect(transformer, destination)

# Run the pipeline
pipeline.run()
AviKatziuk commented 10 months ago

Q [2,e] I would like to use RTDIP components to read from PythonDeltaSource , transform using PandasToPySparkTransformer then write to SparkDeltaMergeDestinationby @Nahrain1

 from rtdip_sdk.pipelines.sources import PythonDeltaSource
from rtdip_sdk.pipelines.transformers import PandasToPySparkTransformer
from rtdip_sdk.pipelines.destinations import SparkDeltaMergeDestination

def pipeline():
    source_df = PythonDeltaSource(
        delta_table_path="path/to/delta/table"
    ).read_batch()

    transformed_df = PandasToPySparkTransformer(
        data=source_df
    ).transform()

    SparkDeltaMergeDestination(
        data=transformed_df,
        options={
            "mergeCondition": "merge_condition",
            "mergeKey": "merge_key",
            "updateColumns": ["col1", "col2"],
            "deleteColumns": ["col3"]
        },
        destination="path/to/merge/destination"
    ).write_batch()

if __name__ == "__main__":
    pipeline()
AviKatziuk commented 10 months ago

Q [2,f] I would like to use RTDIP components to read from SparkDeltaSource , transform using SEMJsonToPCDMTransformer then write to SparkKafkaEventhubDestination by @Nahrain1

from rtdip_sdk.pipelines.sources.spark.delta import SparkDeltaSource
from rtdip_sdk.pipelines.transformers.spark.sem_json_to_pcdm import SEMJsonToPCDMTransformer
from rtdip_sdk.pipelines.destinations.spark.kafka_eventhub import SparkKafkaEventhubDestination
from rtdip_sdk.pipelines.utilities import SparkSessionUtility
import json

def pipeline():
    spark = SparkSessionUtility(config={}).execute()

    delta_source_configuration = {
        "delta.path": "/path/to/delta_table"
    }

    source_df = SparkDeltaSource(spark, delta_source_configuration).read_batch()
    pcdm_df = SEMJsonToPCDMTransformer(source_df, "body").transform()

    kafka_eventhub_destination_configuration = {
        "kafka.bootstrap.servers": "your.kafka.bootstrap.servers",
        "eventhubs.connectionString": "{EventhubConnectionString}"
    }

    SparkKafkaEventhubDestination(
        spark, data=pcdm_df, options=kafka_eventhub_destination_configuration
    ).write_batch()

if __name__ == "__main__":
    pipeline()
cching95 commented 10 months ago

Q[2,a]: I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to EVMContractDestination

unless you are trying to use pipeline execute components, in which case the import is from rtdip_sdk.pipelines.execute import PipelineJob, PipelineStep, PipelineTask which is creating a job which executes a list of pipeline steps. However for the purpose of this challenge and the time constraint, I would suggest sticking to following the outcome of 2c (NOTE ABOVE)

Q[2,b]: I would like to use RTDIP components to read from PythonDeltaSharingSource , transform using SEMJsonToPCDMTransformer then write to SparkDeltaDestination

from rtdip_sdk.pipelines.sources.python.delta_sharing import PythonDeltaSharingSource
from rtdip_sdk.pipelines.transformers.spark.sem_json_to_pcdm import SEMJsonToPCDMTransformer
from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination
PythonDeltaSharingSource(
    profile_path="{CREDENTIAL-FILE-LOCATION}",
    share_name="{SHARE-NAME}",
    schema_name="{SCHEMA-NAME}",
    table_name="{TABLE-NAME}"
).read_batch()
from rtdip_sdk.pipelines.transformers.spark.sem_json_to_pcdm import SEMJsonToPCDMTransformer
from rtdip_sdk.pipelines.destinations.spark.delta import SparkDeltaDestination

def pipeline():

    source = PythonDeltaSharingSource(
        profile_path="{CREDENTIAL-FILE-LOCATION}",
        share_name="{SHARE-NAME}",
        schema_name="{SCHEMA-NAME}",
        table_name="{TABLE-NAME}"
    ).read_batch()

    tranform = SEMJsonToPCDMTransformer(
        data={source},
        source_column_name="{SOURCE-COLUMN-NAME}",
        version={VERSION-NUMBER},
    ).transform()

    SparkDeltaDestination(
        data=tranform,
        options={},
        destination="{DELTA-TABLE-PATH}",
    ).write_batch()

if __name__ == "__main__":
    pipeline()

Q[2,c]: I would like to use RTDIP components to read from SparkDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaDestination

# Not required if using Databricks
spark = SparkSessionUtility(config={}).execute()
Amber-Rigg commented 10 months ago

Q [2,d] I would like to use RTDIP components to read from PythonDeltaSource , transform using BaseRawToMDMTransformer then write to SparkKafkaEventhubDestination

from rtdip_sdk.pipelines.sources import PythonDeltaSource
from rtdip_sdk.pipelines.transformers import BaseRawToMDMTransformer
from rtdip_sdk.pipelines.destinations import SparkKafkaEventhubDestination
from rtdip_sdk.pipelines.utilities import SparkSessionUtility

def pipeline():

# Not required if using Databricks
    spark = SparkSessionUtility(config={}).execute()

    path = "abfss://{FILE-SYSTEM}@{ACCOUNT-NAME}.dfs.core.windows.net/{PATH}/{FILE-NAME}

    source_df = PythonDeltaSource(
    path=path,
    version=None,
    storage_options={
        "azure_storage_account_name": "{AZURE-STORAGE-ACCOUNT-NAME}",
        "azure_storage_account_key": "{AZURE-STORAGE-ACCOUNT-KEY}"
    },
    pyarrow_options=None,
    without_files=False
).read_batch()

     transform_df = BaseRawToMDMTransformer(
        spark=spark,
        data=source_df
    ).transform()

    connectionString = Endpoint=sb://{NAMESPACE}.servicebus.windows.net/;SharedAccessKeyName={ACCESS_KEY_NAME};SharedAccessKey={ACCESS_KEY}=;EntityPath={EVENT_HUB_NAME}

    eventhub_destination = SparkKafkaEventhubDestination(
    spark=spark,
    data=transform_df,
    options={
        "kafka.bootstrap.servers": "host1:port1,host2:port2"
    },
    consumer_group="{YOUR-EVENTHUB-CONSUMER-GROUP}",
    trigger="10 seconds",
    query_name="KafkaEventhubDestination",
    query_wait_interval=None
)
    eventhub_destination.write_batch()

if __name__ == "__main__":
    pipeline()
Amber-Rigg commented 10 months ago

Q [2,e] I would like to use RTDIP components to read from PythonDeltaSource , transform using PandasToPySparkTransformer then write to SparkDeltaMergeDestination

from rtdip_sdk.pipelines.sources import PythonDeltaSource
from rtdip_sdk.pipelines.transformers import PandasToPySparkTransformer
from rtdip_sdk.pipelines.destinations import SparkDeltaMergeDestination
from rtdip_sdk.pipelines.utilities import SparkSessionUtility

def pipeline():
    source_df = PythonDeltaSource(
        delta_table_path="path/to/delta/table"
    ).read_batch()

   # Not required if using Databricks
    spark = SparkSessionUtility(config={}).execute()

    transformed_df = PandasToPySparkTransformer(
        data=source_df,
        spark=spark
    ).transform()

    SparkDeltaMergeDestination(
        data=transformed_df,
        options={
            "mergeCondition": "merge_condition",
            "mergeKey": "merge_key",
            "updateColumns": ["col1", "col2"],
            "deleteColumns": ["col3"]
        },
        destination="path/to/merge/destination"
    ).write_batch()

if __name__ == "__main__":
    pipeline()
Amber-Rigg commented 10 months ago

Q [2,f] I would like to use RTDIP components to read from SparkDeltaSource , transform using SEMJsonToPCDMTransformer then write to SparkKafkaEventhubDestination

from rtdip_sdk.pipelines.sources import SparkDeltaSource
from rtdip_sdk.pipelines.transformers import SEMJsonToPCDMTransformer
from rtdip_sdk.pipelines.destinations import SparkKafkaEventhubDestination
from rtdip_sdk.pipelines.utilities import SparkSessionUtility
import json

def pipeline():
   spark = SparkSessionUtility(config={}).execute()

   delta_source_configuration = {
       "delta.path": "/path/to/delta_table"
   }

   source_df = SparkDeltaSource(spark, delta_source_configuration, options={}).read_batch()
   pcdm_df = SEMJsonToPCDMTransformer(source_df, "body", version=10).transform()

   kafka_eventhub_destination_configuration = {
       "kafka.bootstrap.servers": "your.kafka.bootstrap.servers",
       "eventhubs.connectionString": "{EventhubConnectionString}"
   }

   SparkKafkaEventhubDestination(
       spark, data=pcdm_df, options=kafka_eventhub_destination_configuration,  consumer_group="{YOUR-EVENTHUB-CONSUMER-GROUP}"
   ).write_batch()

if __name__ == "__main__":
   pipeline()