Accessing/querying network_data & module outputs

CBurge95 commented 7 months ago

Following on from this morning's discussion on module outputs, and the use of tables as repeat information. Rather than using tables to display the updated nodes/edges information after running an operation, are we able to access this information as stored in the network_data object? Can this be queried and filter, like with SQL? It doesn't make much sense to simply generate this information without being able to view it or do something with it.

This is particularly useful going forward into creating the mini-app: once we've run an operation, we want to be able to preview it, filter it, re-sort it etc.

If we can do this, we can return network_data as the only output for any centrality module, just with the updated information.

makkus commented 7 months ago

That's the plan as of now, returning a new network_data value, that has all the same columns, apart from one or two additional ones containing centrality information (there are still some open questions in my mind what exactly to attach, but that's independent from this issue).

How to preview values is of course another question, and probably one of the central usage patterns for any UI we are building. Having all of this in the same (result) network_data value should make things easier though, because that network_data value can directly be rendered into a graph visualization, and maybe make the nodes bigger depending on centrality, or however else you plan to make that intuitive to the user. Having to lookup two values to do that would be much harder and messier in terms of frontend-code, IMHO.

caro401 commented 7 months ago

Please can I just have a method on network_data that gives me all the data contained in the nodes table, and another method that gives me all the data in the edges table? Or at least a clear set of steps for how to get this using the methods that currently exist on network_data. I don't understand arrow well enough to go through the underlying data types to dig out this information.

I'll deal with visualising it, I just need access to the raw table data in some table-like or array-like or dict-like format. I often want to show the nodes and edges data separately, so separate methods is useful.

For the moment, I don't care about the performance or memory overhead or serialization cost, just the ability to show this data to a Tropy-mini-app user, who I know will have a fairly small data set. The largest data we've ever seen in tropy is ~100k items, most projects are in the 100s of items, so getting all data at once is not at all a concern.

makkus commented 7 months ago

Sure, just tell me what format you want. You said you where happy with the Arrow format, otherwise I'd given you more options.

Pandas DataFrame? Something else?

caro401 commented 7 months ago

I don't know what the possible options are. If you can just tell me how to get all of the data out of arrow in the correct way, that's fine. Data frame is probably also fine, although I thought you were moving to polars now? Equally a json blob or a python list of lists, I don't really care as long as all the data is in it.

makkus commented 7 months ago

Not sure about correct, that would probably be using the arrow data directly. But it's tabular data, so in theory we can export it to anything in the exact format you need.

Check out: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html

I guess the to_dict or to_pylist or to_pandas would be closest to what you want? If you want something different, I can accomodate that as well, you just need to let me know what exactly, and what format/schema is best for what you are trying to do.

For the example python code I gave you, accessing the methods will look something like:

edges_table = network_data.get_table("edges")
edges_arrow_table = edges_table.arrow_table

an_easier_format = edges_arrow_table.to_pylist()  # or any of the other methods

Alternatively, the KiaraTable class also has some to_* methods, mostly they just forward to the respective arrow/polars methods though, so might as well use the Arrow table directly as described above, either way is fine:

https://github.com/DHARPA-Project/kiara_plugin.tabular/blob/develop/src/kiara_plugin/tabular/models/table.py#L154

edges_table = network_data.get_table("edges")
edges_arrow_table = edges_table.to_polars_dataframe()  # just to shake things up, pylist or dict or pandas is equaliy possible, the pandas one has some args that let you restrict which columns to get if you ever need that

All the non-arrow ones will definitely load all data into memory, but as long as you don't care... And for use-cases like this I also don't, this is more important within a kiara module process method than something like this. Also, I'd imagine further down the line, when you have a preview strategy you'll probably change how to access the data anyway.

DHARPA-Project / kiara-website

Accessing/querying network_data & module outputs #20