DHARPA-Project / kiara-website

Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

How to safely re-"paste" a column after using table.pick.column operation ? #18

Open MariellaCC opened 7 months ago

MariellaCC commented 7 months ago

@makkus

Is there a recommended way to safely re-add a column to a table after using the table.pick.column operation?

Example of why this may be needed: For some operations (e.g. the current version of the tokenize.texts_array module in the Kiara language_processing plugin), an array is requested as module input. Consequently, the table containing the texts needs to be de-assembled via a table.pick.column operation to get an array of texts, before using the tokenize.texts_array module. At a later stage in the pipeline, there will be a need to display a preview of the processed array in the context of the original table. Should the assemble.tables operation be used to re-assemble the table? Does this operation ensure the preservation of the correct assembling of the initial table and the column, or is there an alternate way to proceed?

makkus commented 7 months ago

Good question. Short answer, there is the 'table.merge' module (kiara module explain table.merge).

Long answer: this is a bit more complicated than would seem. The 'table.merge' module is currently not used in any operations, because I haven't thought through all the implications, and I was waiting for some use-cases before I work on it properly. The main problem is that merging tables/arrays together does not have an obvious amount of inputs. For each table/array you want to include, you need one input field for the operation. But since we don't know the number of tables/arrays in advance, we can't hard-code that in the get_inputs_schema method. Which means no operation for now, just a module that you can configure on a case-by-case basis. I imagine we will end up with a few 'base' operations later on, which all use this module under the hood:

From that, users can assemble any sort of tables by chaining the operations. But that is not ideal because we blow up the lineage with a number of steps, when really we would only have to have a single one. And except for some interactive use-case where we don't know in advance how many tables/arrays we have to deal with, we can just use the module directly (for example in declarative pipelines), so it's not really all that pressing for now.

Anyway, here's some example code that should outline how you would do it in Python code, happy to answer follow up quesions:

from kiara.api import KiaraAPI
from kiara.utils.cli import terminal_print
from kiara_plugin.tabular.models.table import KiaraTable

kiara = KiaraAPI.instance()

nodes_table = kiara.get_value("nodes")

pick_input = {
    "table": nodes_table,
    "column_name": "City"
}
pick_result = kiara.run_job("table.pick.column", pick_input)

# info for 'table.merge' module
merge_module_info = kiara.retrieve_module_type_info("table.merge")
print("The module info:")
terminal_print(merge_module_info)

join_to_table_op = {
    "module_type": "table.merge",
    "module_config": {
        "inputs_schema": {
            "orig_table": {
                "type": "table",
                "doc": "The table to add the column to."
            },
            "processed_column": {
                "type": "array",
                "doc": "The array to add as a column to the table."
            }
        }
    }
}

op = kiara.get_operation(join_to_table_op)
print("The info for the dynamically created operation:")
terminal_print(op)

join_inputs = {
    "orig_table": nodes_table,
    "processed_column": pick_result["array"]
}
join_result = kiara.run_job(operation=op, inputs=join_inputs)
joined_table: KiaraTable = join_result["table"].data
print("The resulting table:")
print(joined_table.to_pandas_dataframe())

(there is a 'column_map' config that lets you control how to name the added columns, but that gave me an exception so I'll need to look into it to fix)

makkus commented 7 months ago

(also: come to think of it, it would probably be useful to also let users choose the newly added column names directly, as an option, in addition to hard-configure it -- this is also a feature I'd still need to implement, and it might affect the overall design of the module)

MariellaCC commented 7 months ago

Thank you, I will try like that.

For such a case, do you think it's best to pass an array (versus a table) as an input, when the operation is performed on one column only of a table? Knowing that, often, the need in terms of analysis is to be able to see and compare things in their context (here the context is the table)?

makkus commented 7 months ago

Not sure, I think it depends on the context, and what you try to achieve. I'd imagine most of the patterns I thought about would be frontend-dependent. You could compare by displaying the old/new values side-by side as arrays, or in the same tables.

I haven't really given much thought on how to use any of this in an exploratory style like with jupyter, and the considerations would be quite different because UI frontends have very particular requirements, using kiara exploratory-style via code is probably pretty clumsy and annoying since you loose a lot of flexibility. So I reckon we'll have to get some experience and arrive at recommendations how to do patterns like this.

MariellaCC commented 7 months ago

Alright, I understand that it will also depend on frontend requirements,

so maybe @caro401 you may have/will have in the future insights to share about this specific question (column type versus table type inputs/outputs in modules)?