Green-Software-Foundation / if

Impact Framework
https://if.greensoftware.foundation/
MIT License
148 stars 40 forks source link

CSV Exporting Functionality #411

Closed jawache closed 7 months ago

jawache commented 8 months ago

Story

As a user, I want to export the entire graph as a CSV to analyze the data in other applications.

Rationale

Trying to navigate and understand the impacts of your software application by looking at YAML is very challenging. By exporting into a flat table structure of a CSV, a human can better understand how the impacts are broken down by component and time.

Implementation details

This is an extension to IF that outputs the tree in a CSV format.

ie --manifest --exhaust csv --filter comma,seperated,list,of,parameters,to,export

The manifest is in the format of a YAML that has been processed by impact engine like so


graph:
  outputs: # total per time bucket for the whole graph
    - timestamp: 2023-01-24T11:00
      duration: 360        
      carbon: 19
    - timestamp: 2023-01-24T11:05
      duration: 360        
      carbon: 12
    - timestamp: 2023-01-24T11:10
      duration: 360        
      carbon: 12
  aggregated:
    carbon: 43        
  children:
    application:
      outputs: # aggregated up the tree for every grouping node
        - timestamp: 2023-01-24T11:00
          duration: 360        
          carbon: 19
        - timestamp: 2023-01-24T11:05
          duration: 360        
          carbon: 12
        - timestamp: 2023-01-24T11:10
          duration: 360        
          carbon: 12      
      aggregated:
        carbon: 43          
      children:
        vm1:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 3
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 4
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 5                  
          aggregated:
            carbon: 12
        vm2:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 4
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 5
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 6    
          aggregated:
            carbon: 15                            
        vm3:
          outputs: 
            - timestamp: 2023-01-24T11:00
              duration: 360        
              carbon: 12
            - timestamp: 2023-01-24T11:05
              duration: 360        
              carbon: 3
            - timestamp: 2023-01-24T11:10
              duration: 360        
              carbon: 1                                                
          aggregated:
            carbon: 16                                        

The output CSV file should be of this format:

Path Aggregated 2024-01-25:11-00 2024-01-25:11-05 2024-01-25:11-10
graph.carbon 43 19 12 12
graph.children.application.carbon 43 19 12 12
graph.children.application.children.vm1.carbon 12 3 4 5
graph.children.application.children.vm2.carbon 15 4 5 6
graph.children.application.children.vm3.carbon 16 12 3 1

How to choose the path?

The path column should be a javascript-like path, which we can use to easily identify the node in the graph this parameter relates to.

NOTE: it might be redundant to have children so many times in the key; we may consider stripping that out for brevity and ease of reading.

To aggregate or not?

If aggregated data is present, it should be added to the first column called aggregated.

What if the data is not time-synchronized?

We have a problem! If aggregated data is present then maybe just print out that column, but probably just error out since the columns won't make sense.

Priority

4/5 this tool would be very useful to debug and test other features in IF including aggregating.

Scope

This is an external tool to IF, so other than some docs, it won't affect other things too much.

Size

Several days perhaps including testing.

What does "done" look like?

Does this require updates to documentation or other materials??

It will need documentation.

What testing is required?

Yes, a variety of different graph types.

Is this a known/expected update?

Related to this https://github.com/Green-Software-Foundation/if/issues/298

jmcook1186 commented 8 months ago

Bumping to sprint 7 - this should be an exhaust plugin

pazbardanl commented 8 months ago

Bumping to sprint 7 - this should be an exhaust plugin

already WIP, in draft PR: https://github.com/Green-Software-Foundation/if/pull/441

pazbardanl commented 7 months ago

@narekhovhannisyan @jmcook1186 I think this one could be closed as https://github.com/Green-Software-Foundation/if/pull/441 is merged. right?

jawache commented 7 months ago

Thanks for working on this @pazbardanl, unfortunately, I don't think we're quite done yet ;) I took a look and can see two issues, the first is with the plugin itself and how we've architected the exhaust functionality is a odds with the issue as it's speced out above.

It looks like we've taken the work in the pipeline csv plugin and mirrored it here in the exhaust functionality to work on the whole tree; this results in an export like the one below: full-manifest.out.yml.csv NOTE: The export is a little broken because the physical-processor field also has , in it ;) but not to worry about that because the bigger issue is that the issue requires the content to be output very differently.

This is the expected output, one row per node, time series as columns and parameters as rows.

Path Aggregated 2024-01-25:11-00 2024-01-25:11-05 2024-01-25:11-10
graph.carbon 43 19 12 12
graph.children.application.carbon 43 19 12 12
graph.children.application.children.vm1.carbon 12 3 4 5
graph.children.application.children.vm2.carbon 15 4 5 6
graph.children.application.children.vm3.carbon 16 12 3 1

This is the actual output, it's effectively transposed, the parameters are the columns and the time buckets are the rows.

id timestamp duration cloud/instance-type cloud/vendor cloud/region cpu/utilization grid/carbon-intensity
children.application.children.uk-west.children.server-1.outputs.0 2024-02-26 00:00:00 60 Standard_A1_v2 azure uk-west 89 250
children.application.children.uk-west.children.server-1.outputs.1 2024-02-26 00:01:00 60 Standard_A1_v2 azure uk-west 59 250
children.application.children.uk-west.children.server-1.outputs.2 2024-02-26 00:02:00 60 Standard_A1_v2 azure uk-west 45 250
children.application.children.uk-west.children.server-1.outputs.3 2024-02-26 00:03:00 60 Standard_A1_v2 azure uk-west 21 250
children.application.children.uk-west.children.server-1.outputs.4 2024-02-26 00:04:00 60 Standard_A1_v2 azure uk-west 89 250
children.application.children.uk-west.children.server-1.outputs.5 2024-02-26 00:05:00 60 Standard_A1_v2 azure uk-west 92 250
children.application.children.uk-west.children.server-1.outputs.6 2024-02-26 00:06:00 60 Standard_A1_v2 azure uk-west 91 250

There is also no place for the aggregated values both horizontal or vertical.

Recommended pseudo-code

See the tree in the issue above as an example, the pseudo-code for how this csv exporter should work is like so:

Architecture changes

I'm going to create another ticket and reference it here, but for CSV, we need to be able to pass it the fields we want to export as well as the filename (on the command line), so a few changes are needed there, cc @narekhovhannisyan

pazbardanl commented 7 months ago

@jawache I'm still reading through this is to make sure I capture everything. My only concern is that while the csv structure is well-organized and comprehensive (captures both raw and aggregated outputs) it might not be useful for visualization as it is. For example: if I open this table in excel and try to create a quick line chart from it i might find it difficult since there is no column that has the timestamp and can act as the one for horizontal axis values. Another example is Grafana: trying to use this csv file as a data source for Grafana will also be tricky since there is no column for timestamp.

A simple fix would be to transpose the table, having the paths as column names (make sense as they represent calculated values). Only issue with that is that the "aggregated" row will be perceived as a timestamp which is not the case, and might be visualized as a weird unexplainable spike at the start / end of the chart.

So bottom line: this is what i propose, although it makes our lives harder: Maybe we can have the CSV exporter support both modes:

  1. Aggregation (couldn't find a better name for it) - generate a table with aggregated values, identical to the one you put in your comment.
  2. Visualization- generate a table that's transposed (thus have a timestamp column) and does not have an 'aggregated' column. This one would also separate different children into different tables so that each can have its own chart.
jawache commented 7 months ago

@pazbardanl interesting, I hadn't considered it from the graphana angle.

For me this csv format is mostly for manual human consumption, important to both understand and see aggregated numbers but also as a tool to help rationalize what the manifest is computing. For trivial cases it's doable in the Yaml directly but for even medium use cases with several components it's important to have a way to look at the numbers and ask if it just makes sense or you screwed up the file somewhere, or just quickly see how the numbers aggregate and breakdown.

If however we need another format for graphana visualization that makes sense also.

Maybe rather than overload one function we just have two built in exporters.

csv (my version) csv-raw (your version)

pazbardanl commented 7 months ago

@jawache Ok so I think I am finally on the same page with you about the 'human' use case (sorry it took so long..) Agree - we should probably have 2 of those: csv - for validation by humans. csv-raw - for visualization by tools such as Excel and Grafana.

i think the last detail I'm missing is priority - I'm guessing the csv one is more urgent for the hackathon, right? we got a simple HTML exporter for easy visualization so maybe csv-raw can wait? cc @jmcook1186