dss-extensions / OpenDSSDirect.py

OpenDSSDirect.py: a cross-platform Python package that implements a native/direct library interface to the alternative OpenDSS engine from DSS-Extensions.org
https://dss-extensions.org/OpenDSSDirect.py/
Other
86 stars 21 forks source link

Integration of hdf/parquet loadshapes #98

Open tarekelgindy opened 3 years ago

tarekelgindy commented 3 years ago

Feature Request

Following up on the sourceforce discussion here: https://sourceforge.net/p/electricdss/discussion/861976/thread/59230a2c2d/

This is regarding the support of reading loadshape information in hdf/parquet format for the dss-extensions suite. My understanding is that existing memory allocation mechanisms (such as those here https://github.com/dss-extensions/dss_capi/blob/master/src/CAPI/CAPI_LoadShapes.pas#L453) could be leveraged to stream data from hdf and parquet files and that there is already some existing code which can be backported to support this.

I'd be very supportive of any efforts to integrate this workflow into opendssdirect.py and opendssdirect.jl

Furthermore let me know if there is any interest in allowing for reading multiple loadshapes from a single hdf or parquet file. This could significantly improve the performance of any hdf/parquet reader.

Happy to move this issue to the dss_capi if it makes more sense for it to live there.

PMeira commented 3 years ago

The API for external memory is exposed basically through LoadShapes_Set_Points and LoadShapes_SetMaxPandQ. LoadShapes_SetMaxPandQ is not really required for it to work. If you initialize everything correctly, it might not be required.

As I mentioned on sf.net, here's a link to the current header docs:

https://github.com/dss-extensions/dss_capi/blob/0ef72db7eb046ad5d7a3731ff5ea33b8089b728c/include/dss_capi.h#L6099-L6111

Furthermore let me know if there is any interest in allowing for reading multiple loadshapes from a single hdf

For HDF5, that's how our internal implementation handles things. We have tens of thousands of loadshapes compressed in some HDF files, with one second resolution of a full year. We partitioned them in weeks. The process is basically as follows:

For this chunked approach, we also manage the time externally, especially since we use a variable time-step and custom controls. It's been a while since I wrote that. I can check the details later to see if I missed anything important.

For Parquet, we can leverage Parquet partitioned datasets. I've been using those via PyArrow and the performance has been great. I haven't used the S3 support though, only files in the local network.

So, our implementation (at Unicamp) is not general enough. But, since both HDF5 and Parquet/Arrow supports mechanisms to add extra metadata, we can use that to complement any non-trivial info.

@tarekelgindy, since you mentioned the resolution of 15 minutes, loading everything at once might not be a problem if you don't have too many different loadshapes in the simulation. Remember that using float32 would half the memory requirements too.

Happy to move this issue to the dss_capi if it makes more sense for it to live there.

I think we can keep this here, at the moment.

I can see two approaches. For the short term:

For the long term, a "LoadShape Manager" would be ideal.

Besides loadshapes, there are other reasons I want to integrate with Arrow.

Well, those are my main ideas. Any thoughts?

By the way, the presentation I mentioned is in this panel session: https://resourcecenter.ieee-pes.org/conferences/general-meeting/PES_CVS_GM20_0806_4668_SLD.html -- it's "free for Institutional Subscribers".

PMeira commented 3 years ago

I plan to work a bit on this Thursday and Friday. Davis just added the memory-mapped version on the official code, so it will be good to compare them. For convenience: https://github.com/dss-extensions/dss_capi/commit/3c0fa5fc09257503db04fd9eed8fe9eb4a1eac88

I'll save my comments for later, when I'll have some actual performance numbers.

kdheepak commented 3 years ago

Interesting! Thanks for sharing! I’m also curious about the performance benchmarks. Is there anything I can do to help?

PMeira commented 3 years ago

@kdheepak I can keep you posted. Initially I can reuse some code we use at Unicamp, so it won't be too much work.

Depending on how things go, we can decide to explore the results further or not. EPRI's implementation doesn't handle the main topic of this issue (HDF and Parquet), but it's good for comparison. For reproducibility, using an open dataset of loadshapes and basic circuit would be better. A couple of different circuit sizes would be good too (the sample circuits from OpenDSS are not that big). If there isn't much to highlight, we can just skip this.

Besides the loadshapes, there are other aspects of HPC and distributed execution that I'd like to evaluate better in the future. We (including @tarekelgindy) could talk about that in a later date, if you're available.

tarekelgindy commented 3 years ago

Sorry for not getting back to this sooner! I think what you outlined in your first post here sounds like a great approach. I'm really glad that we'll be able to chunk the data - I think that will make a big difference.

If you have a pre-release version I could probably test it with a few of our larger datasets to provide feedback if that helps. Definitely happy to discuss streamlining with other HPC workflows as well!

PMeira commented 3 years ago

On the performance issues

The numbers from the document Davis linked surprised me, so I tried to reproduce them. For reference:

When using memory mapping, the time required for uploading a model containing a significant amount of load shapes will be drastically reduced without compromising the simulation performance (depending on the source format). In a model with 2768 buses (3952 nodes)with 724 load shapes in SNG format, loading the model into memory without memory mapping takes about 9388 seconds. Otherwise, by using memory mapping the loading time gets reduced to 760 ms as shown in Figure 36.

That 9388 seemed a lot. Using the IEEE8500 test case as a base circuit, I generated 2913 random loadshapes with 525600 points (1 year, 1 minute resolution) as float32 binary files to test the basic performance. Since EPRI's OpenDSS officially only support Windows, this first loading test (no solution) uses a Windows desktop machine, with an SSD and plenty of free RAM. Results so far below. These times fluctuate based on IO load, caching, etc., and the numbers are for a warmed system. And I checked if the data from DSS C-API was correct.

New LoadShape... Command args DSS C-API 0.10.7 DSS C-API WIP OpenDSS 9.2.0.1 OpenDSS SVN r3131
sngfile="loadshape.sng" DNF 3 s DNF DNF
mult=(sngfile="loadshape.sng") 15 s 9.5 s 630 s 630 s
MemoryMapping=Yes mult=(sngfile=loadshape.sng) - 0.3 s - 0.3

Why the first two lines are different?

(For brevity, I'll omit numbers for float64 files)

I didn't test CSV/TXT variants of the methods since I firmly believe they shouldn't be used for large-scale circuits/simulations at all.

With the changes, if it's a long simulation, it doesn't really matter which method is used for "legacy" loadshapes. Something like 9388 would of course be inadvisible compared to 3 seconds.

For a final data point of interest, the time to fill with the loadshapes via the official OpenDSS COM is also very long (win32com or comtypes, both >40 min), while for DSS_Python/ODD.py results in around 7.7 s. This 7.7 is what was large enough to justify LoadShapes_Set_Points initially.

Current progress

I merged (adapted/rewrote) most of the changes related to memory-mapped LS from the official code, and started porting it to Linux. I decided to do this work in the 0.10.x branch, hopefully the last major change in this branch, so I had to backport the relevant changes.

So far, the main change is that I added a stride parameter to LoadShapes_Set_Points, and made the relevant arrays 0-based to simplify a lot of things.

Next step is running some long simulations to assess and document the performance across some variations:

The results will guide the Parquet implementation. I'll get around this in a few days.

tarekelgindy commented 3 years ago

Thanks for the update Paulo. It's pretty interesting to see what a difference the memory mapping made in Opendss. I'll definitely be using this in future versions that require .csv or .txt file inputs.

PMeira commented 3 years ago

It's pretty interesting to see what a difference the memory mapping made in Opendss.

@tarekelgindy This is expected for this first test -- the engine is not using the loadshape data at all, only getting file handles. But I also expect that it won't affect the simulation time as a whole that much, the main advantage is reduced memory load (which is good, of course). It will be interesting to compare Windows vs. Linux too (as a whole, Linux IO is much better, more versatile).

PMeira commented 3 years ago

To add initial info on the timings for loading the circuit, using DSS C-API's LoadShapes_Set_Points from Python and NumPy (np.memmap):

So the extra time for individual files is probably due to the high number of file handles, both from the Python side and in the DSS engine. And a reminder that these numbers include the Python overhead.

What I like about LoadShapes_Set_Points is that it's very versatile. We can use it for memory mapping, shared memory, different memory layouts, chunked data, etc.

I'll continue this Thursday or Friday.

PMeira commented 3 years ago

Some other numbers (all based on DSS C-API):

Test case 1 Process Run-time (relative%) 1 Process Total time (relative%) 20 Processes Run-time (relative%)
Shared memory, dense column-major 100.0 100.0 100.0
DSS (Simple two-point loadshapes) 100.4 100.4 101.4
Memory-mapped, dense column-major 101.4 101.3 99.5
Memory-mapped, dense row-major 105.0 104.9 99.4
Chunk per day, column-major 105.4 105.3 99.2
DSS (MemoryMapping=No) 102.6 107.8 -
Memory-mapped, individual files 113.2 113.4 99.3
Chunk per day, row-major 118.5 118.4 100.2
DSS (MemoryMapping=Yes) 125.5 125.5 100.8

(Relative times to the first row) Run-time = Total time - preparation time (loading the circuit and other data)

This was an older server that was free on the weekend (2x Xeon E5-2630 v4). Looks like the processors are starved for the 20-process case. I'll test on a newer machine when it's available in this next week (2x Xeon Gold 6230), as well as add numbers for some desktop machines. The official OpenDSS COM DLL on Windows might have an issue unrelated to the loadshapes -- I'll have to investigate and report it on the forums some other day.

The "DSS (MemoryMapping=Yes)" case is probably slower than "Memory-mapped, individual files" because I left it unoptimized on purpose -- there are some trivial optimizations that could be applied, in fact I can remove its code and use the same mechanism of LoadShapes_Set_Points; only the CSV file special case would remain. Besides that, it's generally worse due to the large number of files (2913).

"Chunk per day, column-major" is better than "Chunk per day, row-major" here since the on-disk data is a dense row-major matrix for the latter, without partitions, so it's worse than thousands of files. Curiously "Chunk per day, column-major" is slightly faster on average for the 20-process case, but we can see it doesn't really matter which version is used (except loading all the files 20 times wouldn't work). Since it's also in the middle of the pack for the single process run, I'm basing the "LoadShape manager" prototype on it.

tarekelgindy commented 3 years ago

Hi @PMeira ,

Just thought I'd touch base on this. Did you need any help with the integration at all? Thanks again for all your hard work on this - I really appreciate all the time you've been putting into the updates and baselining them!

PMeira commented 3 years ago

@tarekelgindy Just need to finalize the design. The very basic approach is easy to integrate, but a more versatile version would need more work. I'm probably overthinking, so I'll try to provide a full implementation (and test results) of the basic approach with HDF/Parquet this week so that you're able to provide some feedback.

Other news:


When running single processes, all column-major approaches are noticeably better. Even considering only the run-time, they can be faster (up to 20%) than traditional approaches. That seems to extend to multiple processes for the Ryzen machine. For the last one with 8 GB of RAM, older OpenDSS or DSS C-API versions wouldn't be able to run properly since using 64-bit floats would extrapolate that 8 GB.

I might add results for a Raspberry Pi 4 later for completeness, but the general observations across machines/OS have been consistent so far.

2xIntel Xeon Gold 6230, Linux, 512GB / Test case 1 Process: Run-time (relative%) 1 Process: Total time (relative%) 40 Processes: Run-time (relative%) 20 Processes: Run-time (relative%)
Shared memory, dense column-major 100.0 100.0 100.0 100.0
DSS (Simple two-point loadshapes) 102.9 102.9 100.9 102.3
Memory-mapped, dense column-major 102.7 102.7 99.7 99.6
Memory-mapped, dense row-major 109.5 109.5 100.8 106.8
Chunk per day, column-major 99.9 99.9 100.2 101.8
DSS (MemoryMapping=No) 106.6 113.0 - -
Memory-mapped, individual files 116.4 116.6 100.9 107.7
Chunk per day, row-major 109.9 109.9 102.4 105.5
DSS (MemoryMapping=Yes) 116.3 116.2 102.8 110.5

AMD Ryzen 5 3600, 32GB, Windows / Test case 1 Process: Run-time (relative%) 1 Process: Total time (relative%) 10 Processes: Run-time (relative%) 5 Processes: Run-time (relative%)
Shared memory, dense column-major 100.0 100.0 100.0 100.0
DSS (Simple two-point loadshapes) 101.4 101.3 101.5 99.5
Memory-mapped, dense column-major 103.8 103.8 100.7 99.5
Memory-mapped, dense row-major 117.2 117.2 104.7 112.3
Chunk per day, column-major 100.1 100.1 101.6 100.7
DSS (MemoryMapping=No) 122.5 127.0 - -
Memory-mapped, individual files 120.7 120.9 104.0 111.8
Chunk per day, row-major 107.1 107.1 102.8 104.2
DSS (MemoryMapping=Yes) 123.1 123.2 106.6 113.4

Intel i5-4460S, 8 GB, Windows / Test case 1 Process: Run-time (relative%) 1 Process: Total time (relative%) 3 Processes: Run-time (relative%)
Shared memory, dense column-major 100.0 100.0 100.0
DSS (Simple two-point loadshapes) 102.0 102.0 101.2
Memory-mapped, dense column-major 99.9 99.8 99.2
Memory-mapped, dense row-major 114.8 114.7 105.4
Chunk per day, column-major 101.3 101.2 100.0
DSS (MemoryMapping=No) 116.3 238.9 -
Memory-mapped, individual files 115.1 115.4 106.9
Chunk per day, row-major 111.1 111.0 104.7
DSS (MemoryMapping=Yes) 116.1 116.2 106.1
tarekelgindy commented 3 years ago

Hi Paulo - thanks for all the work on this! Just checking - was this on a branch that you have active at the moment?

I've been doing lots of runs with opendssdirect.py where I'll read base models with no loadshapes attached and then set the kW and kVar values in my own code. I use python's multiprocessing to read parquet load files into memory in parallel, and then set the values using the opendssdirect functions, which makes it very fast. If you like I can do some time & memory comparisons of these to see how it compares. I'll be dropping some big datasets soon might be good for testing some of this work on if that helps.

PMeira commented 3 years ago

I had to leave this for a bit, but probably will be able to resume work this Friday. I think I did push most of the code but maybe not for DSS Python.

I use python's multiprocessing to read parquet load files into memory in parallel, and then set the values using the opendssdirect functions, which makes it very fast.

If you use PyArrow, the load performance should be very close. Setting kW and kvar for each load is not ideal though. The Python API overhead is probably a lot.

It seems a new OpenDSS version will finally be released, so I can also get some their more recent changes: https://sourceforge.net/p/electricdss/code/3160/