Closed OpalRussAEMO closed 1 year ago
Following feedback, I've recreated the data payload removing whitespace and this has greatly reduced the file size. Apologies for the incorrect format. It now comes down to 4mb non zipped and 717kb zipped for single register for 730 days of 5 minute interval, no data quality included.
Thanks for taking this onboard Opal and the initial analysis. Some baseline feedback, I'll come back with potential optimisations soon.
lzw
(compress
), zlib
(deflate
) or gzip
(gzip
). lzw
favours speed over size so is larger. Taking the latest intervalData730b.zip it compresses as follows:
lzw
: 1.2MBzlib
: 707KBgzip
: 707KBThroughput Calculations
To calculate absolute worse case for a single register with 5 minute intervals assume the following:
Raw compressed size of 1 second of throughput: 300 Requests 609KB = 178MB of data Time Budget (4.5sec) to Throughput: 178MB / 4.5 seconds = 39MB/sec = 315mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 315mbit $51K/year = $16.06mil per year in access costs levied on total retailer market
It is quite likely in the 3 registers scenario this problem gets worse if only because 1000 record pages are 35% higher than 730 record ones.
Thanks Stuart for doing the math. I am working on some samples using (masked) production data to get a more accurate idea of the file size. Aim to provide early next week. So far the file sizes do look a lot smaller than my mocked up data samples. I did go with the worst case of 5 decimal place intervals so that probably blew it out as well.
Please see zip file for sample usage files using (masked) production data. I've included a summary file in the zip.
So, after a little bit of work I created an experiment (in Perl 🤟 ) to process NEM12 files into the CDR Format and then a few experiments to work towards optimising payload size.
After confirming the individuals consent to distribute I anonymised the NMI and Meter Serial numbers. I chose NEM12 as the source because it allows anyone to fill in a form with their retailer and retrieve their historical usage. 1234567
has a household meter with 15 minute intervals and has 2 registers, PEAK and OFF_PEAK
The code used to create this is available here: Source: https://github.com/perlboy/cdr-tools/tree/main/experiments/energy/nem2cdr
It allows the following command line values:
-f
with input NEM12 file-d
if specified provides the directory to output generated files-s
if specified summarises the differences (as per below)A few notes:
EnergyUsageRead
result not the whole box and dice with links
etc400
records for identifying substitutesA number of experiments were written to see how far I could get the source data to shrink.
Stock NEM12 data in to output a CDR Payload to begin with.
You can take a look at the output
directory in the GitHub for the baseline content.
This takes the 00 Baseline output and instead of specifying the value
component inside the intervalReads
array converts it to a pure number array. In doing this it also moves substitutes to an array not too dissimilar in general concept to how NEM does it.
Broadly speaking this converts:
"intervalReads":[{"value":0},{"value":0},{"value":0},.....]
To
{
"final_substitutes":["7-10"], <------ Exists only if there is substitutes
"intervalReads":[0,0,0,0,....]
This takes the results from 01 No Read Object
and removes the unitOfMeasure
since it is assumed it will generally be KWH
.
This removes the intervalReads
array entirely but leaves the aggregateValue
(sum of reads) and readIntervalLength
in place. The idea here was that a large number of ADR use cases quite possibly never need such high fidelity data and may well be happy with a daily aggregate of usage. Unlike Banking there is no Get Transactions vs. Get Transaction Detail so this seems like a worthwhile optimisation.
This strips zero value reads from the intervals and instead bunches them up in a removedReads
array. The idea here is that if we stop repeating zero it'll be smaller (or so I thought!). There's downsides to this approach though because now intervalReads
isn't a known length which is a different idea than the way NEM files seem to work.
Broadly speaking this results in a payload that looks something like this:
"intervalRead": {
"aggregateValue": 1.676,
"removedReads":["1-31,59-96"],
"readIntervalLength":15,
"intervalReads":[{"value":0.002},{"value":0.001},{"value":0.001},{"value":0.001},{"value":0.029},{"value":0.088}, ....]
This combines both 04 No Zero Value Reads
and 01 No Read Object
.
The result is that it looks something like this:
"intervalRead":
{
"removedReads":["1-30,53,56,59,61-96"],
"aggregateValue":1.688,
"readIntervalLength":15,
"intervalReads": [0.013,0.041,0.101,0.071,0.245,0.165,0.091,.....]}
So another strategy is to start shipping Register IDs inside the interval reads which should hypothetically shrink the payload size in a somewhat exponential way as more Registers are added.
"intervalRead":
{
"readIntervalLength":15,
"intervalReads": {
"E1":
{
"aggregateValue":4.909,
"suffix":"E1",
"reads":[{"value":0.047},{"value":0.046},{"value":0.038}, ......]}
"intervalRead": {
"intervalReads":
{
"B1" : { "reads":[0,0,0,0,0, ...] }
% ./nem2cdr.pl -f samples/1234567_20200519_20220519_20220519101257.csv -s -d output
----------------------------------------
Baseline Total Payload Size: 2386191
Baseline Compressed Payload Size: 177927 (-92.54%)
----------------------------------------
01 No Read Object Compressed Size: 155487 (-12.61%)
02 No Read Object or UOM Compressed Size: 154188 (-13.34%)
03 No Actual Reads Compressed Size: 19793 (-88.88%)
04 No zero value reads Compressed Size: 173880 (-2.27%)
05 No zero value reads or read object Compressed Size: 158668 (-10.82%)
06 Nested Register IDs Compressed Size: 175000 (-1.65%)
07 Nested Register IDs with No Read Object Compressed Size: 164305 (-7.66%)
intervalReads
by default obviously improved the payload size a lot by some 88%+Based on the above the following recommendations can be made.
Object
from the intervalReads
and make it a raw array.In addition, move substitutions to a standalone array with a collapsed structure. In the experiments this looked roughly like
"intervalReads": [..., 0.062,0.062],
"final_substitutes":["7-10"]
Which represents that item 7 through 10 were replaced with Final Substitutes (there's a borrowed sub to do this)
Suggestion 1 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 532KB = 156MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 34.6MB/sec = 277mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 277mbit $51K/year = $14.127mil per year in access costs levied on total retailer market
with-interval-reads
query parameter into Get Usage
endpoints.This would have a default value of false
that omits the reproduction of intervalReads
but keeps all other parts of the payload the same. The overall thinking here is that a ADR may want 2 years to begin with but then would drill down on the detailed usage data over time or in specific areas. This could be coupled with NFR alteration or, if necessary, introduce Get Usage Detail
endpoints.
Suggestion 2 Recalc (applying percentages from above to here) Raw compressed size of 1 second of throughput: 300 Requests 73KB = 21.4MB of data Time Budget (4.5sec) to Throughput: 162MB / 4.5 seconds = 4.75MB/sec = 38mbit of usage Total maximum capital expenditure to retailers at maximum throughput: 38mbit $51K/year = $1.938mil per year in access costs levied on total retailer market
Samples for changing interval values to array for a few scenarios. Note I haven't calculated compressed sizes. These are just raw.
Example with 1000 days of 5 minute intervals
Example with 1000 days of 30 minute intervals, no subs, no export
Example with 1000 days of 30 minute intervals, export
Example with 1000 days of 30 minute intervals, some days with final subs
This issue was discussed in the Energy specific MI call on 14th June. Below is the list of agreed outcomes:
intervalReads
by making it an array of actual reads. Reads of other quality will be represented separately.The DSB will draft the schema resulting with above changes and post a comment for review.
Below is the updated structure for EnergyUsageRead accommodating the agreed changes stated in this comment and prepared in collaboration with AEMO:
{
"data": {
"reads": [
{
"servicePointId": "string",
"registerId": "string",
"registerSuffix": "string",
"meterID": "string",
"controlledLoad": true,
"readStartDate": "string",
"readEndDate": "string",
"unitOfMeasure": "string",
"readUType": "basicRead",
"basicRead": {
"quality": "ACTUAL",
"value": 0
},
"intervalRead": {
"readIntervalLength": 0, // Conditional - Required when interval-reads query parameter is FULL or 30MIN
"aggregateValue": 0,
"intervalReads": [ //Conditional - Required when interval-reads query parameter is FULL or 30MIN. Array of read values.
"Number" // Read value
],
"readQualities": [ // Conditional - Required when interval-reads query parameter is FULL or 30MIN. Required to specify quality of reads that are not actual. For read indices that are not specified, quality is assumed to be actual.
{
"startInterval": "PositiveInteger", // Mandatory - Starts from 1
"endInterval": "PositiveInteger", // Mandatory
"quality": ["FINAL_SUBSTITUTE", "SUBSTITUTE"] // Mandatory
}
]
}
}
]
},
"links": {
"self": "string",
"first": "string",
"prev": "string",
"next": "string",
"last": "string"
},
"meta": {
"totalRecords": 0,
"totalPages": 0
}
}
Other Changes
interval-reads
will be added to the energy usage APIs, specifically the following:
Pending any further feedback, the DSB will recommend the above changes to the Chair for approval.
We suggest 30MIN
be renamed to MIN_30
so that it can be both grouped for future aggregation additions and doesn't conflict with a number of compiler constraints related to enumerations beginning with a number.
This issue has been staged and can be reviewed here - https://github.com/ConsumerDataStandardsAustralia/standards-staging/compare/release/1.18.0...maintenance/514
Description
The payload size for the shared responsibility APIs 'Get usage for service point (SR)' and 'get usage for specific service points (SR)' will be quite substantial with the current proposed format.
I have attached two mocked up example json payloads for a single NMI, single register interval meter capturing data in 5 minute intervals. One with no quality flag, one with quality flag. Uncompressed the file size is 14.5mb without quality flag, 24.1mb with quality flag. Zipped, file size is 746kb without data quality, 912kb with data quality. Where a service point has a dedicated circuit (controlled load) register or a solar register, this will double (or triple where both are applicable) the amount of data to return.
Of concern is the ability to meet NFRs when transferring such a payload as well as energy market participants' market net bandwidth that would be taken up and the associated costs.
To determine if this impact will be material for retailers in the context of meter data received via B2B, I have estimated the file size required to send 1 million interval days worth of 5 minute data via a NEM12 file (b2b format for sending meter data).
A file with 1000 days of 5 minute intervals came to approximately 2.38 mb uncompressed and 31kb compressed. To send 1 million days of 5 minute intervals would be around 2,375 mb uncompressed or 31mb compressed.
Based on the mock CDR usage file size, to respond to only 100 CDR energy usage requests for 2 years of 5 minute interval data for single register / single meter scenarios would be 2,116 mb uncompressed or 85mb compressed.
intervalData730DQ.zip intervalData730.zip
Area Affected
Shared Responsibility APIs - Get Usage for Service Point (SR) and Get Usage For Specific Service Points (SR)
https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-service-point-sr https://consumerdatastandardsaustralia.github.io/standards/#get-usage-for-specific-service-points-sr
Change Proposed
Issue raised for discussion and comment.
The current format requires repetition of Service Point, Meter & Register configuration details for every day of interval data (assume due to requirement to provide reads in date order), separating out quality from interval values and making interval values a list would reduce file size somewhat.