OpalRussAEMO commented 2 years ago

Description

The payload size for the shared responsibility APIs 'Get usage for service point (SR)' and 'get usage for specific service points (SR)' will be quite substantial with the current proposed format.

I have attached two mocked up example json payloads for a single NMI, single register interval meter capturing data in 5 minute intervals. One with no quality flag, one with quality flag. Uncompressed the file size is 14.5mb without quality flag, 24.1mb with quality flag. Zipped, file size is 746kb without data quality, 912kb with data quality. Where a service point has a dedicated circuit (controlled load) register or a solar register, this will double (or triple where both are applicable) the amount of data to return.

Edit - following advice, have recreated file removing whitespace, without the whitespace it comes down to 4mb non zipped and 717kb zipped for single register for 730 days of 5 minute interval.

Of concern is the ability to meet NFRs when transferring such a payload as well as energy market participants' market net bandwidth that would be taken up and the associated costs.

To determine if this impact will be material for retailers in the context of meter data received via B2B, I have estimated the file size required to send 1 million interval days worth of 5 minute data via a NEM12 file (b2b format for sending meter data).

A file with 1000 days of 5 minute intervals came to approximately 2.38 mb uncompressed and 31kb compressed. To send 1 million days of 5 minute intervals would be around 2,375 mb uncompressed or 31mb compressed.

Based on the mock CDR usage file size, to respond to only 100 CDR energy usage requests for 2 years of 5 minute interval data for single register / single meter scenarios would be 2,116 mb uncompressed or 85mb compressed.

Edit - removing whitespace, it would be approximately 400 mb non compressed and 71mb compressed.

intervalData730DQ.zip intervalData730.zip

Area Affected

Shared Responsibility APIs - Get Usage for Service Point (SR) and Get Usage For Specific Service Points (SR)

https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-service-point-sr https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-specific-service-points-sr

Change Proposed

Issue raised for discussion and comment.

The current format requires repetition of Service Point, Meter & Register configuration details for every day of interval data (assume due to requirement to provide reads in date order), separating out quality from interval values and making interval values a list would reduce file size somewhat.

OpalRussAEMO commented 2 years ago

intervalData730b.zip

Following feedback, I've recreated the data payload removing whitespace and this has greatly reduced the file size. Apologies for the incorrect format. It now comes down to 4mb non zipped and 717kb zipped for single register for 730 days of 5 minute interval, no data quality included.

perlboy commented 2 years ago

Thanks for taking this onboard Opal and the initial analysis. Some baseline feedback, I'll come back with potential optimisations soon.

Payloads will always be compressed because the AEMO endpoint mandates it which alleviates most whitespace concerns (adds ~4-5%). This will be either lzw (compress), zlib (deflate) or gzip (gzip). lzw favours speed over size so is larger. Taking the latest intervalData730b.zip it compresses as follows:
- lzw: 1.2MB
- zlib: 707KB
- gzip: 707KB
The target NFR for the entire ecosystem (since there is only one AEMO) is 300 TPS
While only a single register is provided in the sample data anecdotally the most common we've seen in a 15 person extract is 15 minute increments with 3 registers.
The supplied sample data has 847 records in it, not sure why but when reducing to 730 records the uncompressed size of 730 days, 5 minute intervals, 1 register is 3.4MB and the gzip compressed size is 609KB.

Throughput Calculations

To calculate absolute worse case for a single register with 5 minute intervals assume the following:

AEMO is being accessed at 300 TPS
100% is meter usage requests
Page Size of 1000 is used (so the full 730 records is provided)
730 days of data is 609KB
Large Secondary Requests require completion within 4500 ms
AEMO currently charges $17,000/year for 256Kbit, $51,000/year for 1Mbit of access to MarketNet

Raw compressed size of 1 second of throughput: 300 Requests 609KB = 178MB of data Time Budget (4.5sec) to Throughput: 178MB / 4.5 seconds = 39MB/sec = 315mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 315mbit $51K/year = $16.06mil per year in access costs levied on total retailer market

It is quite likely in the 3 registers scenario this problem gets worse if only because 1000 record pages are 35% higher than 730 record ones.

OpalRussAEMO commented 2 years ago

Thanks Stuart for doing the math. I am working on some samples using (masked) production data to get a more accurate idea of the file size. Aim to provide early next week. So far the file sizes do look a lot smaller than my mocked up data samples. I did go with the worst case of 5 decimal place intervals so that probably blew it out as well.

OpalRussAEMO commented 2 years ago

Please see zip file for sample usage files using (masked) production data. I've included a summary file in the zip.

usageData.zip

perlboy commented 2 years ago

So, after a little bit of work I created an experiment (in Perl 🤟 ) to process NEM12 files into the CDR Format and then a few experiments to work towards optimising payload size.

Sample Set

After confirming the individuals consent to distribute I anonymised the NMI and Meter Serial numbers. I chose NEM12 as the source because it allows anyone to fill in a form with their retailer and retrieve their historical usage. 1234567 has a household meter with 15 minute intervals and has 2 registers, PEAK and OFF_PEAK

Code

The code used to create this is available here: Source: https://github.com/perlboy/cdr-tools/tree/main/experiments/energy/nem2cdr

It allows the following command line values:

-f with input NEM12 file
-d if specified provides the directory to output generated files
-s if specified summarises the differences (as per below)

A few notes:

It's not a perfect payload, I just focused on the EnergyUsageRead result not the whole box and dice with links etc
It outputs non pretty printed json
It roughly handles 400 records for identifying substitutes
It uses GzipFaster partially cause I was lazy but also because I assumed that web servers would prioritise performance over maximum compression (i think it's a few passes less)
Probably has some bugs in it, it's not meant to be a perfect renderer just a tool to play with options!

Experiment Methods

A number of experiments were written to see how far I could get the source data to shrink.

00 Baseline

Stock NEM12 data in to output a CDR Payload to begin with.

You can take a look at the output directory in the GitHub for the baseline content.

01 No Read Object

This takes the 00 Baseline output and instead of specifying the value component inside the intervalReads array converts it to a pure number array. In doing this it also moves substitutes to an array not too dissimilar in general concept to how NEM does it.

Broadly speaking this converts:

"intervalReads":[{"value":0},{"value":0},{"value":0},.....]

To

{
"final_substitutes":["7-10"], <------ Exists only if there is substitutes
"intervalReads":[0,0,0,0,....]

02 No Read Object or Unit of Measure

This takes the results from 01 No Read Object and removes the unitOfMeasure since it is assumed it will generally be KWH.

03 No Actual Reads

This removes the intervalReads array entirely but leaves the aggregateValue (sum of reads) and readIntervalLength in place. The idea here was that a large number of ADR use cases quite possibly never need such high fidelity data and may well be happy with a daily aggregate of usage. Unlike Banking there is no Get Transactions vs. Get Transaction Detail so this seems like a worthwhile optimisation.

04 No Zero Value Reads

This strips zero value reads from the intervals and instead bunches them up in a removedReads array. The idea here is that if we stop repeating zero it'll be smaller (or so I thought!). There's downsides to this approach though because now intervalReads isn't a known length which is a different idea than the way NEM files seem to work.

Broadly speaking this results in a payload that looks something like this:

"intervalRead": {
  "aggregateValue": 1.676,
  "removedReads":["1-31,59-96"],
  "readIntervalLength":15,
  "intervalReads":[{"value":0.002},{"value":0.001},{"value":0.001},{"value":0.001},{"value":0.029},{"value":0.088}, ....]

05 No Zero Value Reads or Read Object

This combines both 04 No Zero Value Reads and 01 No Read Object.

The result is that it looks something like this:

"intervalRead":
   {
   "removedReads":["1-30,53,56,59,61-96"],
   "aggregateValue":1.688,
   "readIntervalLength":15,
   "intervalReads": [0.013,0.041,0.101,0.071,0.245,0.165,0.091,.....]}

06 Nested Register IDs

So another strategy is to start shipping Register IDs inside the interval reads which should hypothetically shrink the payload size in a somewhat exponential way as more Registers are added.

"intervalRead":
  {
   "readIntervalLength":15,
   "intervalReads": {
   "E1":
      {
         "aggregateValue":4.909,
         "suffix":"E1",
         "reads":[{"value":0.047},{"value":0.046},{"value":0.038}, ......]}

07 Nested Register IDs with No Read Object

"intervalRead": {
   "intervalReads":
      {
         "B1" : { "reads":[0,0,0,0,0, ...] }

Experiment Results

Raw Output

% ./nem2cdr.pl -f samples/1234567_20200519_20220519_20220519101257.csv -s -d output
----------------------------------------
Baseline Total Payload Size: 2386191
Baseline Compressed Payload Size: 177927 (-92.54%)
----------------------------------------
01 No Read Object Compressed Size: 155487 (-12.61%)
02 No Read Object or UOM Compressed Size: 154188 (-13.34%)
03 No Actual Reads Compressed Size: 19793 (-88.88%)
04 No zero value reads Compressed Size: 173880 (-2.27%)
05 No zero value reads or read object Compressed Size: 158668 (-10.82%)
06 Nested Register IDs Compressed Size: 175000 (-1.65%)
07 Nested Register IDs with No Read Object Compressed Size: 164305 (-7.66%)

Observations

ℹ️ The baseline payload size is about 2.27MB
ℹ️ Compression reduces this down to about 173KB. This roughly lines up with previous numbers because they were modelled on 5 minute reads and our source data is 15 minutes
❌ Removing Unit of Measure helped a bit but not a lot, probably not worth it
❌ Surprisingly No zero value reads didn't reduce the file size by that much. I guess the compression algorithm really helps with repetitious zeros
❌ Nesting Register IDs didn't really give a big win for 2 registers and interestingly kind of "blew up" the savings with no read object
✅ Removing Read Object and shuffling the substitutions to another array gave a decent improvement of 12% on the payload. Given the frequency of substitutions (in the sample data we've seen there might be 1 or 2 over 2 years) the savings seem worth it
✅ Removing the intervalReads by default obviously improved the payload size a lot by some 88%+

Recommendations

Based on the above the following recommendations can be made.

Remove the `Object` from the `intervalReads` and make it a raw array.

In addition, move substitutions to a standalone array with a collapsed structure. In the experiments this looked roughly like

"intervalReads": [..., 0.062,0.062],
"final_substitutes":["7-10"]

Which represents that item 7 through 10 were replaced with Final Substitutes (there's a borrowed sub to do this)

Suggestion 1 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 532KB = 156MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 34.6MB/sec = 277mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 277mbit $51K/year = $14.127mil per year in access costs levied on total retailer market

Introduce a `with-interval-reads` query parameter into `Get Usage` endpoints.

This would have a default value of false that omits the reproduction of intervalReads but keeps all other parts of the payload the same. The overall thinking here is that a ADR may want 2 years to begin with but then would drill down on the detailed usage data over time or in specific areas. This could be coupled with NFR alteration or, if necessary, introduce Get Usage Detail endpoints.

Suggestion 2 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 73KB = 21.4MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 4.75MB/sec = 38mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 38mbit $51K/year = $1.938mil per year in access costs levied on total retailer market

OpalRussAEMO commented 2 years ago

Samples for changing interval values to array for a few scenarios. Note I haven't calculated compressed sizes. These are just raw.

Example with 1000 days of 5 minute intervals

As per current payload structure 5953 KB usageData_6203NMI016_20220603225741_1.zip
changing interval values to array 2860 KB (52% reduction) alt_usageData_6203NMI016_20220614111109_1.zip

Example with 1000 days of 30 minute intervals, no subs, no export

as per current payload structure 1074 KB usageData_6203NMI012_20220603225731_1.zip
changing interval values to array 559 KB (48% reduction) alt_usageData_6203NMI012_20220614111101_1.zip

Example with 1000 days of 30 minute intervals, export

as per current payload structure 1105 KB usageData_6203NMI008_20220603225726_1.zip
changing interval values to array 590 KB (47% reduction) alt_usageData_6203NMI008_20220614111056_1.zip

Example with 1000 days of 30 minute intervals, some days with final subs

as per current payload structure 1168 KB usageData_NCCCNMI091_20220603225843_1.zip
changing interval values to array,
quality flag to array of quality objects with range specified instead of repeating quality flag
646 kb (45% reduction); note: when separated out days with just subs for comparison, it was 53% reduction for those days. alt_usageData_NCCCNMI091_20220614111145_1.zip

CDR-API-Stream commented 2 years ago

This issue was discussed in the Energy specific MI call on 14th June. Below is the list of agreed outcomes:

Incorporate both recommendations from @perlboy's comment listed below:
- Optimise intervalReads by making it an array of actual reads. Reads of other quality will be represented separately.
- Introduce a query parameter to specifically request raw/full interval reads. By default aggregate reads would be returned
Introduce a query parameter to request for interval reads aggregated to 30 minute as recommended by @OpalRussAEMO . This and the above query parameter could be defined as an ENUM
The above changes will be incorporated for Nov 15 Energy go-live as AEMO have indicated they can adopt the changes in time.

The DSB will draft the schema resulting with above changes and post a comment for review.

CDR-API-Stream commented 2 years ago

Below is the updated structure for EnergyUsageRead accommodating the agreed changes stated in this comment and prepared in collaboration with AEMO:

{
    "data": {
      "reads": [
        {
          "servicePointId": "string",
          "registerId": "string",
          "registerSuffix": "string",
          "meterID": "string",
          "controlledLoad": true,
          "readStartDate": "string",
          "readEndDate": "string",
          "unitOfMeasure": "string",
          "readUType": "basicRead",
          "basicRead": {
            "quality": "ACTUAL",
            "value": 0
          },
          "intervalRead": { 
            "readIntervalLength": 0, // Conditional - Required when interval-reads query parameter is FULL or 30MIN
            "aggregateValue": 0,
            "intervalReads": [ //Conditional - Required when interval-reads query parameter is FULL or 30MIN.  Array of read values.
                "Number" // Read value
                ], 
            "readQualities": [ // Conditional - Required when interval-reads query parameter is FULL or 30MIN. Required to specify quality of reads that are not actual.  For read indices that are not specified, quality is assumed to be actual. 
                {
                    "startInterval": "PositiveInteger", // Mandatory - Starts from 1
                    "endInterval": "PositiveInteger", // Mandatory
                    "quality": ["FINAL_SUBSTITUTE", "SUBSTITUTE"] // Mandatory
                }
            ]
          }
        }
      ]
    },
    "links": {
      "self": "string",
      "first": "string",
      "prev": "string",
      "next": "string",
      "last": "string"
    },
    "meta": {
      "totalRecords": 0,
      "totalPages": 0
    }
  }

Other Changes

A new optional query parameter interval-reads will be added to the energy usage APIs, specifically the following:
- Get Usage For Service Point (both DH and SDH versions of the API)
- Get Usage For Specific Service Points (both DH and SDH versions of the API)
- Get Bulk Usage
It query parameter will be an ENUM with the following values:
- NONE - No interval read values will be provided. Interval Reads will only include the parameter aggregateValue (daily aggregated read value). This will be the default value if the query parameter is not provided
- 30MIN - Where AEMO have interval data for the service point, Interval Values will be aggregated to an interval length of 30 where the interval length is 5 minute or 15 minute.
- FULL - Where AEMO have interval data for the service point, the full list of Interval Values will be provided in the interval length as received by AEMO.

Pending any further feedback, the DSB will recommend the above changes to the Chair for approval.

biza-io commented 2 years ago

We suggest 30MIN be renamed to MIN_30 so that it can be both grouped for future aggregation additions and doesn't conflict with a number of compiler constraints related to enumerations beginning with a number.

CDR-API-Stream commented 2 years ago

This issue has been staged and can be reviewed here - https://github.com/ConsumerDataStandardsAustralia/standards-staging/compare/release/1.18.0...maintenance/514

ConsumerDataStandardsAustralia / standards-maintenance

Get Usage For ... Shared Responsibility APIs Payload size #514

Description

Area Affected

Change Proposed

Sample Set

Code

Experiment Methods

00 Baseline

01 No Read Object

02 No Read Object or Unit of Measure

03 No Actual Reads

04 No Zero Value Reads

05 No Zero Value Reads or Read Object

06 Nested Register IDs

07 Nested Register IDs with No Read Object

Experiment Results

Raw Output

Observations

Recommendations

Remove the `Object` from the `intervalReads` and make it a raw array.

Introduce a `with-interval-reads` query parameter into `Get Usage` endpoints.

ConsumerDataStandardsAustralia / standards-maintenance

Get Usage For ... Shared Responsibility APIs Payload size #514

Description

Area Affected

Change Proposed

Sample Set

Code

Experiment Methods

00 Baseline

01 No Read Object

02 No Read Object or Unit of Measure

03 No Actual Reads

04 No Zero Value Reads

05 No Zero Value Reads or Read Object

06 Nested Register IDs

07 Nested Register IDs with No Read Object

Experiment Results

Raw Output

Observations

Recommendations

Remove the Object from the intervalReads and make it a raw array.

Introduce a with-interval-reads query parameter into Get Usage endpoints.

Remove the `Object` from the `intervalReads` and make it a raw array.

Introduce a `with-interval-reads` query parameter into `Get Usage` endpoints.