Integration of hdf/parquet loadshapes

tarekelgindy commented 3 years ago

Feature Request

Following up on the sourceforce discussion here: https://sourceforge.net/p/electricdss/discussion/861976/thread/59230a2c2d/

This is regarding the support of reading loadshape information in hdf/parquet format for the dss-extensions suite. My understanding is that existing memory allocation mechanisms (such as those here https://github.com/dss-extensions/dss_capi/blob/master/src/CAPI/CAPI_LoadShapes.pas#L453) could be leveraged to stream data from hdf and parquet files and that there is already some existing code which can be backported to support this.

I'd be very supportive of any efforts to integrate this workflow into opendssdirect.py and opendssdirect.jl

Furthermore let me know if there is any interest in allowing for reading multiple loadshapes from a single hdf or parquet file. This could significantly improve the performance of any hdf/parquet reader.

Happy to move this issue to the dss_capi if it makes more sense for it to live there.

PMeira commented 3 years ago

The API for external memory is exposed basically through LoadShapes_Set_Points and LoadShapes_SetMaxPandQ. LoadShapes_SetMaxPandQ is not really required for it to work. If you initialize everything correctly, it might not be required.

As I mentioned on sf.net, here's a link to the current header docs:

https://github.com/dss-extensions/dss_capi/blob/0ef72db7eb046ad5d7a3731ff5ea33b8089b728c/include/dss_capi.h#L6099-L6111

Furthermore let me know if there is any interest in allowing for reading multiple loadshapes from a single hdf

For HDF5, that's how our internal implementation handles things. We have tens of thousands of loadshapes compressed in some HDF files, with one second resolution of a full year. We partitioned them in weeks. The process is basically as follows:

Process the data/metadata from the loads to select the loadshapes.
Create the initial loadshapes with very few points and ideally already with the max multipliers (so that the system Y matrix is already in a good state from the start).
Create a big NumPy array A in the correct format
Use LoadShapes_Set_Points to map each pair of rows to a loadshape: LoadShapes_Set_Points(int64_t Npts, void *HoursPtr, void *PMultPtr, void *QMultPtr, int8_t ExternalMemory, int8_t IsFloat32). First row of the pair becomes PMultPtr, while the second becomes QMultPtr. IIRC, we use the first row as the shared timestamp pointer, passed as HoursPtr. Depending on the simulation, you can omit that (e.g. when the simulation timestep matches) and use a null pointer. Setting ExternalMemory to 1, this operation is trivial and there is no memory copies, otherwise it just copies the data.
Read the loadshape points for the next time chunk into A.
Run the simulation for the time chunk (week).
Repeat the last two steps...

For this chunked approach, we also manage the time externally, especially since we use a variable time-step and custom controls. It's been a while since I wrote that. I can check the details later to see if I missed anything important.

For Parquet, we can leverage Parquet partitioned datasets. I've been using those via PyArrow and the performance has been great. I haven't used the S3 support though, only files in the local network.

So, our implementation (at Unicamp) is not general enough. But, since both HDF5 and Parquet/Arrow supports mechanisms to add extra metadata, we can use that to complement any non-trivial info.

@tarekelgindy, since you mentioned the resolution of 15 minutes, loading everything at once might not be a problem if you don't have too many different loadshapes in the simulation. Remember that using float32 would half the memory requirements too.

Happy to move this issue to the dss_capi if it makes more sense for it to live there.

I think we can keep this here, at the moment.

I can see two approaches. For the short term:

After some minor updates, release DSS C-API 0.11 with the current features. This includes LoadShapes_Set_Points and more interest changes, see https://github.com/dss-extensions/dss_capi/blob/master/docs/changelog.md#version-011
Release DSS Python 0.11 to match that. Since the CFFI binary is hosted in DSS Python, I believe ODD.py requires only a few changes due to the updated API conventions.
Use those to prototype a more general mechanism in Python, while still allowing users to drop to the low level API if required. This would be already be way faster than the traditional approaches (CSVs, separate binary files, or plain old dss.LoadShape.PMult(...)), and allows testing Parquet, HDF5, memory-mapped, shared memory ideas.

For the long term, a "LoadShape Manager" would be ideal.

This would be integrated in DSS C-API, so it would be available for all DSS Extensions.
It would be responsible for loading the loadshapes, from different sources, different formats, chunked or not, but integrated with the time mechanism from OpenDSS.
It would be implemented in C++, since:
- We already use C++ in our KLUSolve fork
- We've been hitting some limits in the (Free)Pascal code
- The Arrow implementation for C++ is very complete -- https://arrow.apache.org/docs/status.html
- HDF5 support is "native" in C++ too. It's a bit awkward, but I used it in the past.
- There are mature compilers for most platforms.
We can expose it as some few (internal) functions that DSS C-API itself uses.
We could leave it as an optional dependency, either at compile time or runtime -- try to load a shared library and disable the related functions if it's not available. Maybe we can build a separate package/project for this "LoadShape Manager".
The full Arrow distribution is kinda big, but I don't think we need many of the components. Moving DSS Python and OpenDSSDirect.py to conda-forge would simplify this for Python, but I don't think there's an easy out for all languages.

Besides loadshapes, there are other reasons I want to integrate with Arrow.

Well, those are my main ideas. Any thoughts?

By the way, the presentation I mentioned is in this panel session: https://resourcecenter.ieee-pes.org/conferences/general-meeting/PES_CVS_GM20_0806_4668_SLD.html -- it's "free for Institutional Subscribers".

PMeira commented 3 years ago

I plan to work a bit on this Thursday and Friday. Davis just added the memory-mapped version on the official code, so it will be good to compare them. For convenience: https://github.com/dss-extensions/dss_capi/commit/3c0fa5fc09257503db04fd9eed8fe9eb4a1eac88

I'll save my comments for later, when I'll have some actual performance numbers.

kdheepak commented 3 years ago

Interesting! Thanks for sharing! I’m also curious about the performance benchmarks. Is there anything I can do to help?

PMeira commented 3 years ago

@kdheepak I can keep you posted. Initially I can reuse some code we use at Unicamp, so it won't be too much work.

Depending on how things go, we can decide to explore the results further or not. EPRI's implementation doesn't handle the main topic of this issue (HDF and Parquet), but it's good for comparison. For reproducibility, using an open dataset of loadshapes and basic circuit would be better. A couple of different circuit sizes would be good too (the sample circuits from OpenDSS are not that big). If there isn't much to highlight, we can just skip this.

Besides the loadshapes, there are other aspects of HPC and distributed execution that I'd like to evaluate better in the future. We (including @tarekelgindy) could talk about that in a later date, if you're available.

tarekelgindy commented 3 years ago

Sorry for not getting back to this sooner! I think what you outlined in your first post here sounds like a great approach. I'm really glad that we'll be able to chunk the data - I think that will make a big difference.

If you have a pre-release version I could probably test it with a few of our larger datasets to provide feedback if that helps. Definitely happy to discuss streamlining with other HPC workflows as well!

PMeira commented 3 years ago

On the performance issues

The numbers from the document Davis linked surprised me, so I tried to reproduce them. For reference:

When using memory mapping, the time required for uploading a model containing a significant amount of load shapes will be drastically reduced without compromising the simulation performance (depending on the source format). In a model with 2768 buses (3952 nodes)with 724 load shapes in SNG format, loading the model into memory without memory mapping takes about 9388 seconds. Otherwise, by using memory mapping the loading time gets reduced to 760 ms as shown in Figure 36.

That 9388 seemed a lot. Using the IEEE8500 test case as a base circuit, I generated 2913 random loadshapes with 525600 points (1 year, 1 minute resolution) as float32 binary files to test the basic performance. Since EPRI's OpenDSS officially only support Windows, this first loading test (no solution) uses a Windows desktop machine, with an SSD and plenty of free RAM. Results so far below. These times fluctuate based on IO load, caching, etc., and the numbers are for a warmed system. And I checked if the data from DSS C-API was correct.

New LoadShape... Command args	DSS C-API 0.10.7	DSS C-API WIP	OpenDSS 9.2.0.1	OpenDSS SVN r3131
sngfile="loadshape.sng"	DNF	3 s	DNF	DNF
mult=(sngfile="loadshape.sng")	15 s	9.5 s	630 s	630 s
MemoryMapping=Yes mult=(sngfile=loadshape.sng)	-	0.3 s	-	0.3

Why the first two lines are different?

sngfile="loadshape.sng" is implemented in the LoadShape code.
- It was using legacy file functions.
- Since we need to replace the legacy file functions (see https://github.com/dss-extensions/dss_capi/issues/93), I tested reworking the code for sngfile="loadshape.sng" (and dblfile=...). This also allows an optimization to use float32 data in memory, halving the memory usage (if there is any change or API read, it's converted to float64 automatically).
mult=(sngfile="loadshape.sng"): this uses the InterpretDblArray from Utilities.pas.
- All data was first copied to memory and then copied into the target arrays.
- Without more changes, we cannot use the float32 optimization here since that's an internal aspect of the loadshape implementation in DSS C-API, not exposed to InterpretDblArray.
- For float32, I managed to remove 5 seconds, the rest is probably spent in the float64 to float32 conversion.

(For brevity, I'll omit numbers for float64 files)

I didn't test CSV/TXT variants of the methods since I firmly believe they shouldn't be used for large-scale circuits/simulations at all.

With the changes, if it's a long simulation, it doesn't really matter which method is used for "legacy" loadshapes. Something like 9388 would of course be inadvisible compared to 3 seconds.

For a final data point of interest, the time to fill with the loadshapes via the official OpenDSS COM is also very long (win32com or comtypes, both >40 min), while for DSS_Python/ODD.py results in around 7.7 s. This 7.7 is what was large enough to justify LoadShapes_Set_Points initially.

Current progress

I merged (adapted/rewrote) most of the changes related to memory-mapped LS from the official code, and started porting it to Linux. I decided to do this work in the 0.10.x branch, hopefully the last major change in this branch, so I had to backport the relevant changes.

So far, the main change is that I added a stride parameter to LoadShapes_Set_Points, and made the relevant arrays 0-based to simplify a lot of things.

Next step is running some long simulations to assess and document the performance across some variations:

the new internal MM mechanism
using MM externally, with a huge file containing all loadshapes
row-major chunks
column-major chunks
different chunk sizes
shared memory with multiple processes
check general behavior in different machines

The results will guide the Parquet implementation. I'll get around this in a few days.

tarekelgindy commented 3 years ago

Thanks for the update Paulo. It's pretty interesting to see what a difference the memory mapping made in Opendss. I'll definitely be using this in future versions that require .csv or .txt file inputs.

PMeira commented 3 years ago

It's pretty interesting to see what a difference the memory mapping made in Opendss.

@tarekelgindy This is expected for this first test -- the engine is not using the loadshape data at all, only getting file handles. But I also expect that it won't affect the simulation time as a whole that much, the main advantage is reduced memory load (which is good, of course). It will be interesting to compare Windows vs. Linux too (as a whole, Linux IO is much better, more versatile).

PMeira commented 3 years ago

To add initial info on the timings for loading the circuit, using DSS C-API's LoadShapes_Set_Points from Python and NumPy (np.memmap):

using a big on-disk matrix, memory mapped, it takes less than 150ms (vs 300ms+)
memory-mapping individual files: 360 ms

So the extra time for individual files is probably due to the high number of file handles, both from the Python side and in the DSS engine. And a reminder that these numbers include the Python overhead.

What I like about LoadShapes_Set_Points is that it's very versatile. We can use it for memory mapping, shared memory, different memory layouts, chunked data, etc.

I'll continue this Thursday or Friday.

PMeira commented 3 years ago

Some other numbers (all based on DSS C-API):

Test case	1 Process Run-time (relative%)	1 Process Total time (relative%)	20 Processes Run-time (relative%)
Shared memory, dense column-major	100.0	100.0	100.0
DSS (Simple two-point loadshapes)	100.4	100.4	101.4
Memory-mapped, dense column-major	101.4	101.3	99.5
Memory-mapped, dense row-major	105.0	104.9	99.4
Chunk per day, column-major	105.4	105.3	99.2
DSS (MemoryMapping=No)	102.6	107.8	-
Memory-mapped, individual files	113.2	113.4	99.3
Chunk per day, row-major	118.5	118.4	100.2
DSS (MemoryMapping=Yes)	125.5	125.5	100.8

(Relative times to the first row) Run-time = Total time - preparation time (loading the circuit and other data)

This was an older server that was free on the weekend (2x Xeon E5-2630 v4). Looks like the processors are starved for the 20-process case. I'll test on a newer machine when it's available in this next week (2x Xeon Gold 6230), as well as add numbers for some desktop machines. The official OpenDSS COM DLL on Windows might have an issue unrelated to the loadshapes -- I'll have to investigate and report it on the forums some other day.

The "DSS (MemoryMapping=Yes)" case is probably slower than "Memory-mapped, individual files" because I left it unoptimized on purpose -- there are some trivial optimizations that could be applied, in fact I can remove its code and use the same mechanism of LoadShapes_Set_Points; only the CSV file special case would remain. Besides that, it's generally worse due to the large number of files (2913).

"Chunk per day, column-major" is better than "Chunk per day, row-major" here since the on-disk data is a dense row-major matrix for the latter, without partitions, so it's worse than thousands of files. Curiously "Chunk per day, column-major" is slightly faster on average for the 20-process case, but we can see it doesn't really matter which version is used (except loading all the files 20 times wouldn't work). Since it's also in the middle of the pack for the single process run, I'm basing the "LoadShape manager" prototype on it.

tarekelgindy commented 3 years ago

Hi @PMeira ,

Just thought I'd touch base on this. Did you need any help with the integration at all? Thanks again for all your hard work on this - I really appreciate all the time you've been putting into the updates and baselining them!

PMeira commented 3 years ago

@tarekelgindy Just need to finalize the design. The very basic approach is easy to integrate, but a more versatile version would need more work. I'm probably overthinking, so I'll try to provide a full implementation (and test results) of the basic approach with HDF/Parquet this week so that you're able to provide some feedback.

Other news:

Started a 0.12.x version with most of the changes, abandoning the previous 0.11.x.
- Besides API changes and the migration to klusolvex that would land in 0.13.x, there is just another major change (removal of the global state, allowing multiple DSSContexts per process) that I would like to merge if there is time.
- All changes so far should be safe, but there could be corner-case IO issues (and formatting changes) in the reports.
- I changed the library name to dss_capi (instead of dss_capi_v7), as well as file locations, but otherwise it's still compatible with the 0.10.x branch.
About that performance issue with the official OpenDSS, tracked and reported on the original thread here -- it's not related to loadshapes: https://sourceforge.net/p/electricdss/discussion/861976/thread/59230a2c2d/?limit=25#0c08/fed5/e41d/29d7/564a
Ran the tests for 3 other machines, results below.

When running single processes, all column-major approaches are noticeably better. Even considering only the run-time, they can be faster (up to 20%) than traditional approaches. That seems to extend to multiple processes for the Ryzen machine. For the last one with 8 GB of RAM, older OpenDSS or DSS C-API versions wouldn't be able to run properly since using 64-bit floats would extrapolate that 8 GB.

I might add results for a Raspberry Pi 4 later for completeness, but the general observations across machines/OS have been consistent so far.

2xIntel Xeon Gold 6230, Linux, 512GB / Test case	1 Process: Run-time (relative%)	1 Process: Total time (relative%)	40 Processes: Run-time (relative%)	20 Processes: Run-time (relative%)
Shared memory, dense column-major	100.0	100.0	100.0	100.0
DSS (Simple two-point loadshapes)	102.9	102.9	100.9	102.3
Memory-mapped, dense column-major	102.7	102.7	99.7	99.6
Memory-mapped, dense row-major	109.5	109.5	100.8	106.8
Chunk per day, column-major	99.9	99.9	100.2	101.8
DSS (MemoryMapping=No)	106.6	113.0	-	-
Memory-mapped, individual files	116.4	116.6	100.9	107.7
Chunk per day, row-major	109.9	109.9	102.4	105.5
DSS (MemoryMapping=Yes)	116.3	116.2	102.8	110.5

AMD Ryzen 5 3600, 32GB, Windows / Test case	1 Process: Run-time (relative%)	1 Process: Total time (relative%)	10 Processes: Run-time (relative%)	5 Processes: Run-time (relative%)
Shared memory, dense column-major	100.0	100.0	100.0	100.0
DSS (Simple two-point loadshapes)	101.4	101.3	101.5	99.5
Memory-mapped, dense column-major	103.8	103.8	100.7	99.5
Memory-mapped, dense row-major	117.2	117.2	104.7	112.3
Chunk per day, column-major	100.1	100.1	101.6	100.7
DSS (MemoryMapping=No)	122.5	127.0	-	-
Memory-mapped, individual files	120.7	120.9	104.0	111.8
Chunk per day, row-major	107.1	107.1	102.8	104.2
DSS (MemoryMapping=Yes)	123.1	123.2	106.6	113.4

Intel i5-4460S, 8 GB, Windows / Test case	1 Process: Run-time (relative%)	1 Process: Total time (relative%)	3 Processes: Run-time (relative%)
Shared memory, dense column-major	100.0	100.0	100.0
DSS (Simple two-point loadshapes)	102.0	102.0	101.2
Memory-mapped, dense column-major	99.9	99.8	99.2
Memory-mapped, dense row-major	114.8	114.7	105.4
Chunk per day, column-major	101.3	101.2	100.0
DSS (MemoryMapping=No)	116.3	238.9	-
Memory-mapped, individual files	115.1	115.4	106.9
Chunk per day, row-major	111.1	111.0	104.7
DSS (MemoryMapping=Yes)	116.1	116.2	106.1

tarekelgindy commented 3 years ago

Hi Paulo - thanks for all the work on this! Just checking - was this on a branch that you have active at the moment?

I've been doing lots of runs with opendssdirect.py where I'll read base models with no loadshapes attached and then set the kW and kVar values in my own code. I use python's multiprocessing to read parquet load files into memory in parallel, and then set the values using the opendssdirect functions, which makes it very fast. If you like I can do some time & memory comparisons of these to see how it compares. I'll be dropping some big datasets soon might be good for testing some of this work on if that helps.

PMeira commented 3 years ago

I had to leave this for a bit, but probably will be able to resume work this Friday. I think I did push most of the code but maybe not for DSS Python.

I use python's multiprocessing to read parquet load files into memory in parallel, and then set the values using the opendssdirect functions, which makes it very fast.

If you use PyArrow, the load performance should be very close. Setting kW and kvar for each load is not ideal though. The Python API overhead is probably a lot.

It seems a new OpenDSS version will finally be released, so I can also get some their more recent changes: https://sourceforge.net/p/electricdss/code/3160/

dss-extensions / OpenDSSDirect.py

Integration of hdf/parquet loadshapes #98

Feature Request