ConsumerDataStandardsAustralia / standards-maintenance

This repository houses the interactions, consultations and work management to support the maintenance of baselined components of the Consumer Data Right API Standards and Information Security profile.
41 stars 9 forks source link

Get Usage For ... Shared Responsibility APIs Payload size #514

Closed OpalRussAEMO closed 1 year ago

OpalRussAEMO commented 2 years ago

Description

The payload size for the shared responsibility APIs 'Get usage for service point (SR)' and 'get usage for specific service points (SR)' will be quite substantial with the current proposed format.

I have attached two mocked up example json payloads for a single NMI, single register interval meter capturing data in 5 minute intervals. One with no quality flag, one with quality flag. Uncompressed the file size is 14.5mb without quality flag, 24.1mb with quality flag. Zipped, file size is 746kb without data quality, 912kb with data quality. Where a service point has a dedicated circuit (controlled load) register or a solar register, this will double (or triple where both are applicable) the amount of data to return.

Of concern is the ability to meet NFRs when transferring such a payload as well as energy market participants' market net bandwidth that would be taken up and the associated costs.

To determine if this impact will be material for retailers in the context of meter data received via B2B, I have estimated the file size required to send 1 million interval days worth of 5 minute data via a NEM12 file (b2b format for sending meter data).

A file with 1000 days of 5 minute intervals came to approximately 2.38 mb uncompressed and 31kb compressed. To send 1 million days of 5 minute intervals would be around 2,375 mb uncompressed or 31mb compressed.

Based on the mock CDR usage file size, to respond to only 100 CDR energy usage requests for 2 years of 5 minute interval data for single register / single meter scenarios would be 2,116 mb uncompressed or 85mb compressed.

intervalData730DQ.zip intervalData730.zip

Area Affected

Shared Responsibility APIs - Get Usage for Service Point (SR) and Get Usage For Specific Service Points (SR)

https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-service-point-sr https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-specific-service-points-sr

Change Proposed

Issue raised for discussion and comment.

The current format requires repetition of Service Point, Meter & Register configuration details for every day of interval data (assume due to requirement to provide reads in date order), separating out quality from interval values and making interval values a list would reduce file size somewhat.

OpalRussAEMO commented 2 years ago

intervalData730b.zip

Following feedback, I've recreated the data payload removing whitespace and this has greatly reduced the file size. Apologies for the incorrect format. It now comes down to 4mb non zipped and 717kb zipped for single register for 730 days of 5 minute interval, no data quality included.

perlboy commented 2 years ago

Thanks for taking this onboard Opal and the initial analysis. Some baseline feedback, I'll come back with potential optimisations soon.

Throughput Calculations

To calculate absolute worse case for a single register with 5 minute intervals assume the following:

Raw compressed size of 1 second of throughput: 300 Requests 609KB = 178MB of data Time Budget (4.5sec) to Throughput: 178MB / 4.5 seconds = 39MB/sec = 315mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 315mbit $51K/year = $16.06mil per year in access costs levied on total retailer market

It is quite likely in the 3 registers scenario this problem gets worse if only because 1000 record pages are 35% higher than 730 record ones.

OpalRussAEMO commented 2 years ago

Thanks Stuart for doing the math. I am working on some samples using (masked) production data to get a more accurate idea of the file size. Aim to provide early next week. So far the file sizes do look a lot smaller than my mocked up data samples. I did go with the worst case of 5 decimal place intervals so that probably blew it out as well.

OpalRussAEMO commented 2 years ago

Please see zip file for sample usage files using (masked) production data. I've included a summary file in the zip.

usageData.zip

perlboy commented 2 years ago

So, after a little bit of work I created an experiment (in Perl 🤟 ) to process NEM12 files into the CDR Format and then a few experiments to work towards optimising payload size.

Sample Set

After confirming the individuals consent to distribute I anonymised the NMI and Meter Serial numbers. I chose NEM12 as the source because it allows anyone to fill in a form with their retailer and retrieve their historical usage. 1234567 has a household meter with 15 minute intervals and has 2 registers, PEAK and OFF_PEAK

Code

The code used to create this is available here: Source: https://github.com/perlboy/cdr-tools/tree/main/experiments/energy/nem2cdr

It allows the following command line values:

A few notes:

Experiment Methods

A number of experiments were written to see how far I could get the source data to shrink.

00 Baseline

Stock NEM12 data in to output a CDR Payload to begin with.

You can take a look at the output directory in the GitHub for the baseline content.

01 No Read Object

This takes the 00 Baseline output and instead of specifying the value component inside the intervalReads array converts it to a pure number array. In doing this it also moves substitutes to an array not too dissimilar in general concept to how NEM does it.

Broadly speaking this converts:

"intervalReads":[{"value":0},{"value":0},{"value":0},.....]

To

{
"final_substitutes":["7-10"], <------ Exists only if there is substitutes
"intervalReads":[0,0,0,0,....]

02 No Read Object or Unit of Measure

This takes the results from 01 No Read Object and removes the unitOfMeasure since it is assumed it will generally be KWH.

03 No Actual Reads

This removes the intervalReads array entirely but leaves the aggregateValue (sum of reads) and readIntervalLength in place. The idea here was that a large number of ADR use cases quite possibly never need such high fidelity data and may well be happy with a daily aggregate of usage. Unlike Banking there is no Get Transactions vs. Get Transaction Detail so this seems like a worthwhile optimisation.

04 No Zero Value Reads

This strips zero value reads from the intervals and instead bunches them up in a removedReads array. The idea here is that if we stop repeating zero it'll be smaller (or so I thought!). There's downsides to this approach though because now intervalReads isn't a known length which is a different idea than the way NEM files seem to work.

Broadly speaking this results in a payload that looks something like this:

"intervalRead": {
  "aggregateValue": 1.676,
  "removedReads":["1-31,59-96"],
  "readIntervalLength":15,
  "intervalReads":[{"value":0.002},{"value":0.001},{"value":0.001},{"value":0.001},{"value":0.029},{"value":0.088}, ....]

05 No Zero Value Reads or Read Object

This combines both 04 No Zero Value Reads and 01 No Read Object.

The result is that it looks something like this:

"intervalRead":
   {
   "removedReads":["1-30,53,56,59,61-96"],
   "aggregateValue":1.688,
   "readIntervalLength":15,
   "intervalReads": [0.013,0.041,0.101,0.071,0.245,0.165,0.091,.....]}

06 Nested Register IDs

So another strategy is to start shipping Register IDs inside the interval reads which should hypothetically shrink the payload size in a somewhat exponential way as more Registers are added.

"intervalRead":
  {
   "readIntervalLength":15,
   "intervalReads": {
   "E1":
      {
         "aggregateValue":4.909,
         "suffix":"E1",
         "reads":[{"value":0.047},{"value":0.046},{"value":0.038}, ......]}

07 Nested Register IDs with No Read Object

"intervalRead": {
   "intervalReads":
      {
         "B1" : { "reads":[0,0,0,0,0, ...] }

Experiment Results

Raw Output

% ./nem2cdr.pl -f samples/1234567_20200519_20220519_20220519101257.csv -s -d output
----------------------------------------
Baseline Total Payload Size: 2386191
Baseline Compressed Payload Size: 177927 (-92.54%)
----------------------------------------
01 No Read Object Compressed Size: 155487 (-12.61%)
02 No Read Object or UOM Compressed Size: 154188 (-13.34%)
03 No Actual Reads Compressed Size: 19793 (-88.88%)
04 No zero value reads Compressed Size: 173880 (-2.27%)
05 No zero value reads or read object Compressed Size: 158668 (-10.82%)
06 Nested Register IDs Compressed Size: 175000 (-1.65%)
07 Nested Register IDs with No Read Object Compressed Size: 164305 (-7.66%)

Observations

Recommendations

Based on the above the following recommendations can be made.

Remove the Object from the intervalReads and make it a raw array.

In addition, move substitutions to a standalone array with a collapsed structure. In the experiments this looked roughly like

"intervalReads": [..., 0.062,0.062],
"final_substitutes":["7-10"]

Which represents that item 7 through 10 were replaced with Final Substitutes (there's a borrowed sub to do this)

Suggestion 1 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 532KB = 156MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 34.6MB/sec = 277mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 277mbit $51K/year = $14.127mil per year in access costs levied on total retailer market

Introduce a with-interval-reads query parameter into Get Usage endpoints.

This would have a default value of false that omits the reproduction of intervalReads but keeps all other parts of the payload the same. The overall thinking here is that a ADR may want 2 years to begin with but then would drill down on the detailed usage data over time or in specific areas. This could be coupled with NFR alteration or, if necessary, introduce Get Usage Detail endpoints.

Suggestion 2 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 73KB = 21.4MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 4.75MB/sec = 38mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 38mbit $51K/year = $1.938mil per year in access costs levied on total retailer market

OpalRussAEMO commented 2 years ago

Samples for changing interval values to array for a few scenarios. Note I haven't calculated compressed sizes. These are just raw.

Example with 1000 days of 5 minute intervals

Example with 1000 days of 30 minute intervals, no subs, no export

Example with 1000 days of 30 minute intervals, export

Example with 1000 days of 30 minute intervals, some days with final subs

CDR-API-Stream commented 2 years ago

This issue was discussed in the Energy specific MI call on 14th June. Below is the list of agreed outcomes:

The DSB will draft the schema resulting with above changes and post a comment for review.

CDR-API-Stream commented 2 years ago

Below is the updated structure for EnergyUsageRead accommodating the agreed changes stated in this comment and prepared in collaboration with AEMO:

{
    "data": {
      "reads": [
        {
          "servicePointId": "string",
          "registerId": "string",
          "registerSuffix": "string",
          "meterID": "string",
          "controlledLoad": true,
          "readStartDate": "string",
          "readEndDate": "string",
          "unitOfMeasure": "string",
          "readUType": "basicRead",
          "basicRead": {
            "quality": "ACTUAL",
            "value": 0
          },
          "intervalRead": { 
            "readIntervalLength": 0, // Conditional - Required when interval-reads query parameter is FULL or 30MIN
            "aggregateValue": 0,
            "intervalReads": [ //Conditional - Required when interval-reads query parameter is FULL or 30MIN.  Array of read values.
                "Number" // Read value
                ], 
            "readQualities": [ // Conditional - Required when interval-reads query parameter is FULL or 30MIN. Required to specify quality of reads that are not actual.  For read indices that are not specified, quality is assumed to be actual. 
                {
                    "startInterval": "PositiveInteger", // Mandatory - Starts from 1
                    "endInterval": "PositiveInteger", // Mandatory
                    "quality": ["FINAL_SUBSTITUTE", "SUBSTITUTE"] // Mandatory
                }
            ]
          }
        }
      ]
    },
    "links": {
      "self": "string",
      "first": "string",
      "prev": "string",
      "next": "string",
      "last": "string"
    },
    "meta": {
      "totalRecords": 0,
      "totalPages": 0
    }
  }

Other Changes

Pending any further feedback, the DSB will recommend the above changes to the Chair for approval.

biza-io commented 2 years ago

We suggest 30MIN be renamed to MIN_30 so that it can be both grouped for future aggregation additions and doesn't conflict with a number of compiler constraints related to enumerations beginning with a number.

CDR-API-Stream commented 2 years ago

This issue has been staged and can be reviewed here - https://github.com/ConsumerDataStandardsAustralia/standards-staging/compare/release/1.18.0...maintenance/514