CSV Exporting Functionality

jawache commented 8 months ago

Story

As a user, I want to export the entire graph as a CSV to analyze the data in other applications.

Rationale

Trying to navigate and understand the impacts of your software application by looking at YAML is very challenging. By exporting into a flat table structure of a CSV, a human can better understand how the impacts are broken down by component and time.

Implementation details

This is an extension to IF that outputs the tree in a CSV format.

ie --manifest --exhaust csv --filter comma,seperated,list,of,parameters,to,export

The manifest is in the format of a YAML that has been processed by impact engine like so


graph:
  outputs: # total per time bucket for the whole graph
    - timestamp: 2023-01-24T11:00
      duration: 360        
      carbon: 19
    - timestamp: 2023-01-24T11:05
      duration: 360        
      carbon: 12
    - timestamp: 2023-01-24T11:10
      duration: 360        
      carbon: 12
  aggregated:
    carbon: 43        
  children:
    application:
      outputs: # aggregated up the tree for every grouping node
        - timestamp: 2023-01-24T11:00
          duration: 360        
          carbon: 19
        - timestamp: 2023-01-24T11:05
          duration: 360        
          carbon: 12
        - timestamp: 2023-01-24T11:10
          duration: 360        
          carbon: 12      
      aggregated:
        carbon: 43          
      children:
        vm1:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 3
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 4
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 5                  
          aggregated:
            carbon: 12
        vm2:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 4
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 5
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 6    
          aggregated:
            carbon: 15                            
        vm3:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 12
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 3
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 1                                                
          aggregated:
            carbon: 16

The output CSV file should be of this format:

Path	Aggregated	2024-01-25:11-00	2024-01-25:11-05	2024-01-25:11-10
graph.carbon	43	19	12	12
graph.children.application.carbon	43	19	12	12
graph.children.application.children.vm1.carbon	12	3	4	5
graph.children.application.children.vm2.carbon	15	4	5	6
graph.children.application.children.vm3.carbon	16	12	3	1

How to choose the path?

The path column should be a javascript-like path, which we can use to easily identify the node in the graph this parameter relates to.

NOTE: it might be redundant to have children so many times in the key; we may consider stripping that out for brevity and ease of reading.

To aggregate or not?

If aggregated data is present, it should be added to the first column called aggregated.

What if the data is not time-synchronized?

We have a problem! If aggregated data is present then maybe just print out that column, but probably just error out since the columns won't make sense.

Priority

4/5 this tool would be very useful to debug and test other features in IF including aggregating.

Scope

This is an external tool to IF, so other than some docs, it won't affect other things too much.

Size

Several days perhaps including testing.

What does "done" look like?

A script that exports a variety of different graphs
Documentation regarding what types of graph structures it works with and where it will error out.

Does this require updates to documentation or other materials??

It will need documentation.

What testing is required?

Yes, a variety of different graph types.

Is this a known/expected update?

jmcook1186 commented 8 months ago

Bumping to sprint 7 - this should be an exhaust plugin

pazbardanl commented 8 months ago

Bumping to sprint 7 - this should be an exhaust plugin

already WIP, in draft PR: https://github.com/Green-Software-Foundation/if/pull/441

pazbardanl commented 7 months ago

@narekhovhannisyan @jmcook1186 I think this one could be closed as https://github.com/Green-Software-Foundation/if/pull/441 is merged. right?

jawache commented 7 months ago

Thanks for working on this @pazbardanl, unfortunately, I don't think we're quite done yet ;) I took a look and can see two issues, the first is with the plugin itself and how we've architected the exhaust functionality is a odds with the issue as it's speced out above.

It looks like we've taken the work in the pipeline csv plugin and mirrored it here in the exhaust functionality to work on the whole tree; this results in an export like the one below: full-manifest.out.yml.csv NOTE: The export is a little broken because the physical-processor field also has , in it ;) but not to worry about that because the bigger issue is that the issue requires the content to be output very differently.

This is the expected output, one row per node, time series as columns and parameters as rows.

Path	Aggregated	2024-01-25:11-00	2024-01-25:11-05	2024-01-25:11-10
graph.carbon	43	19	12	12
graph.children.application.carbon	43	19	12	12
graph.children.application.children.vm1.carbon	12	3	4	5
graph.children.application.children.vm2.carbon	15	4	5	6
graph.children.application.children.vm3.carbon	16	12	3	1

This is the actual output, it's effectively transposed, the parameters are the columns and the time buckets are the rows.

id	timestamp	duration	cloud/instance-type	cloud/vendor	cloud/region	cpu/utilization	grid/carbon-intensity
children.application.children.uk-west.children.server-1.outputs.0	2024-02-26 00:00:00	60	Standard_A1_v2	azure	uk-west	89	250
children.application.children.uk-west.children.server-1.outputs.1	2024-02-26 00:01:00	60	Standard_A1_v2	azure	uk-west	59	250
children.application.children.uk-west.children.server-1.outputs.2	2024-02-26 00:02:00	60	Standard_A1_v2	azure	uk-west	45	250
children.application.children.uk-west.children.server-1.outputs.3	2024-02-26 00:03:00	60	Standard_A1_v2	azure	uk-west	21	250
children.application.children.uk-west.children.server-1.outputs.4	2024-02-26 00:04:00	60	Standard_A1_v2	azure	uk-west	89	250
children.application.children.uk-west.children.server-1.outputs.5	2024-02-26 00:05:00	60	Standard_A1_v2	azure	uk-west	92	250
children.application.children.uk-west.children.server-1.outputs.6	2024-02-26 00:06:00	60	Standard_A1_v2	azure	uk-west	91	250

There is also no place for the aggregated values both horizontal or vertical.

Recommended pseudo-code

See the tree in the issue above as an example, the pseudo-code for how this csv exporter should work is like so:

Figure out the columns take the first node with outputs, and whatever are those time buckets assume those are the same time buckets for every node in the tree ( it will break if that's not true, that's fine, csv export doesn't make sense otherwise). aggregated is the first column before the time series.
It doesn't really matter how you navigate but top-down is fine, i don't think you'll get a CSV that is ordered quite like the above but it's an easy fix in excel just sort by the first column.
For each field we want to output, e.g. carbon (this needs to be passed to you from the command line, people will only want to output a few specific fields and not every field in the tree)
For each node in the tree, create an array same length as the headers which is where you will store your row data.
Get the node path like you are; append .carbon and pre-pend tree, that's the first cell, e.g. tree.children.application.children.uk-west.children.server-1.carbon
- NOTE: there should be a row with just tree; the root tree node can have outputs that are aggregated up to the root, see the example in the issue above.
If the node has an aggregated field, read from that carbon, and that's the second cell with the title of aggregated, if there is no aggregated field just leave blank.
If the node has a param called outputs, then grab the carbon from each of the observations in the outputs and those are your values for the time-buckets.
- NOTE: No need to double check the timestamps match, if there are 10 observations, get 10 carbon values and append them to the row you are creating. The big assumption here is that every node has exactly the same number of observations for exactly the same timestamps. If it doesn't, that's why we build the time-sync plugin, so you can ensure every node has the same number of observations and timestamps.
That should be it, if you follow the above by the time you finish parsing all the nodes you should get a CSV matching the version in the above issue.

Architecture changes

I'm going to create another ticket and reference it here, but for CSV, we need to be able to pass it the fields we want to export as well as the filename (on the command line), so a few changes are needed there, cc @narekhovhannisyan

pazbardanl commented 7 months ago

@jawache I'm still reading through this is to make sure I capture everything. My only concern is that while the csv structure is well-organized and comprehensive (captures both raw and aggregated outputs) it might not be useful for visualization as it is. For example: if I open this table in excel and try to create a quick line chart from it i might find it difficult since there is no column that has the timestamp and can act as the one for horizontal axis values. Another example is Grafana: trying to use this csv file as a data source for Grafana will also be tricky since there is no column for timestamp.

A simple fix would be to transpose the table, having the paths as column names (make sense as they represent calculated values). Only issue with that is that the "aggregated" row will be perceived as a timestamp which is not the case, and might be visualized as a weird unexplainable spike at the start / end of the chart.

So bottom line: this is what i propose, although it makes our lives harder: Maybe we can have the CSV exporter support both modes:

Aggregation (couldn't find a better name for it) - generate a table with aggregated values, identical to the one you put in your comment.
Visualization- generate a table that's transposed (thus have a timestamp column) and does not have an 'aggregated' column. This one would also separate different children into different tables so that each can have its own chart.

jawache commented 7 months ago

@pazbardanl interesting, I hadn't considered it from the graphana angle.

For me this csv format is mostly for manual human consumption, important to both understand and see aggregated numbers but also as a tool to help rationalize what the manifest is computing. For trivial cases it's doable in the Yaml directly but for even medium use cases with several components it's important to have a way to look at the numbers and ask if it just makes sense or you screwed up the file somewhere, or just quickly see how the numbers aggregate and breakdown.

If however we need another format for graphana visualization that makes sense also.

Maybe rather than overload one function we just have two built in exporters.

csv (my version) csv-raw (your version)

pazbardanl commented 7 months ago

@jawache Ok so I think I am finally on the same page with you about the 'human' use case (sorry it took so long..) Agree - we should probably have 2 of those: csv - for validation by humans. csv-raw - for visualization by tools such as Excel and Grafana.

i think the last detail I'm missing is priority - I'm guessing the csv one is more urgent for the hackathon, right? we got a simple HTML exporter for easy visualization so maybe csv-raw can wait? cc @jmcook1186

Green-Software-Foundation / if