OpenWaterAnalytics / EPANET

The Water Distribution System Hydraulic and Water Quality Analysis Toolkit
MIT License
272 stars 201 forks source link

Batch request node and link properties #785

Open AbelHeinsbroek opened 2 months ago

AbelHeinsbroek commented 2 months ago

Hello,

When working with large models, getting all the required parameter values for all nodes and links can be relatively inefficient and time consuming if you make use of the EN_getnodevalue and EN_getlinkvalue methods in a loop for all node and link indices and the parameters you're interested in.

For example, getting all the pressures at all the nodes in a network requires you to make EN_NODECOUNT calls to the EPANET API.

While yes, its possible to write all data to a binary output file and parse that file accordingly, that is still rather inefficient as you by default get all parameters for all nodes as well as links and other unwanted data. In addition, if you're not careful writing a binary file to disk and reopening that file for processing is much slower than reading from memory directly.

Therefore I propose adding two new methods to the EPANET API:

ENgetnodesvalues(EN_Project ph, int property, double *out_values) ENgetlinksvalues(EN_Project ph, int property, double *out_values)

lbutler commented 2 months ago

I believe if we added these new methods it would be the first API that returned an array of data. So we'd want to pick a convention that we could follow for any other future methods.

I had a quick look through the documentation and EN_setvertices and EN_setcurve appear to be the only methods that take in array - but please let me know if I'm wrong and there exists a method that returns an array.

Also just to play devils advocate, but is it worth adding a new method to the library when the end users could also ~10 LoC to their own application to extract results?

You mentioned that there is an inefficiency in that the user would need to call EN_getcount and then run EN_getnodevalue or EN_getlinkevalue multiple times, however by adding ENgetnodesvalues you're not changing the amount of cycles. You would need to get the count to initialize the array and then the library is calling those methods internally.

Saying that, this would be a huge improvement for epanet-js as there is an overhead to go back and forth between WASM and JS, and it becomes apparent when you make multiple calls to extract values, especially when trying to get all the simulation results.

From my testing I could run a function like EN_getnodevalue ~90Mil op/sec in native C on a small linux VM but if I tried to run it in WASM, the best I could achieve was ~3Mil op/sec.

lbutler commented 2 months ago

I dug up an old replit I made to test the performance of the API in C. The model is relatively small, only 2k nodes but the code extracts every simulation property for all nodes and links and it takes 0.0002 seconds.

https://replit.com/@lbutler1/epanet-v22-performance#main.c

Not to say these new functions your proposing aren't worthy, but maybe instead of inefficiency its about improving the convenience of the API

eladsal commented 2 months ago

@lbutler we have EN_setpattern that takes an array.

AbelHeinsbroek commented 2 months ago

Hi Luke,

Agreed, the source of the performance issues lies in the overhead of translating API calls from other languages, i.e. ctypes or wasm bindings, not in the way EPANET handles things. It is more of a convenience method then something that is strictly necessary.

However, I'm fairly confident to assume that the (vast?) majority of users use some kind of scripting language such as MatLAB, Python, Javascript or R to interact with EPANET and to process, analyse and visualise the results, instead of communicating with the API using C.

For our use-case, implementing these convenience methods will help speed up our Python based applications by a factor of ±30!

lbutler commented 2 months ago

@AbelHeinsbroek - Great we're on the same page then, sorry for being a bit pedantic with my response!

I do think adding these methods would be beneficial for many of the downstream libraries that provide EPANET in other programming languages.

Mariosmsk commented 2 months ago

Hi Luke,

Agreed, the source of the performance issues lies in the overhead of translating API calls from other languages, i.e. ctypes or wasm bindings, not in the way EPANET handles things. It is more of a convenience method then something that is strictly necessary.

However, I'm fairly confident to assume that the (vast?) majority of users use some kind of scripting language such as MatLAB, Python, Javascript or R to interact with EPANET and to process, analyse and visualise the results, instead of communicating with the API using C.

For our use-case, implementing these convenience methods will help speed up our Python based applications by a factor of ±30!

Hi @AbelHeinsbroek, I agree with you!

I think we don’t need to add new functions. Instead, we could modify the existing ones, such as EN_getnodevalue, EN_getlinkvalue, and others like EN_getlinktype, EN_getlinkid, EN_getnodeid, etc., to return arrays based on the input indices. This change would also need to maintain backward compatibility.

lbutler commented 2 months ago

@Mariosmsk I guess the hardest part of using the existing functions would be maintaining backwards compatibility. You could in theory pass a number like zero or some other constant in the index to signify you wanted all results.

But I'm not sure if this would make things more complicated then having a second function?

Very rough guess below of how it could be implemented

int DLLEXPORT EN_getnodevalue(EN_Project p, int index, int property, double *value) {
    // Check for all nodes request by special index value
    if (index == 0) { // Using 0 as the magic number to fetch all nodes
        int status = 0;
        for (int i = 1; i <= p->network.Nnodes; i++) {
            double tempValue = 0;
            status = EN_getnodevalue(p, i, property, &tempValue);
            if (status != 0) return status; // Propagate errors immediately
            value[i - 1] = tempValue; // Store each node's value
        }
        return 0;
    }

    // Check for valid arguments
    if (!p->Openflag) return 102;
    if (index < 1 || index > p->network.Nnodes) return 203; // Adjusted to disallow the magic number in normal operations

    // Existing code to handle single node request
    // ... existing switch case logic ...
    return 0; // Return 0 for success
}
LRossman commented 2 months ago

Sharing my 2 cents:

  1. I think the two new functions are worth adding.
  2. The existing EN_getnodevalueand EN_getlinkvaluefunctions should not be modified to return all values based on a special value used for the index passed into them. First principle of good coding is that functions should do only one thing.
  3. @AbelHeinsbroek 's committed code looks fine - nice and simple. (However the name ENgetnodevaluesused in epanet2.h and epanet2.c is different than the EN_getnodesvaluesused in epanet2_2.h and in epanet.c (one has '"node" in it while the other has "nodes"-- they should be the same.)
  4. I prefer EN_getnodevalues ("node" and not "nodes") and EN_getlinkvalues) which is more acceptable English.
  5. While we're at it why not add retrieval functions for computed results that would access the binary output file after an analysis is completed: EN_getnoderesult(EN_project p, int nodevariable, int nodeindex, int timeperiod, double value) EN_getlinkresult(EN_project p, int linkvariable, int linkindex, int timeperiod, double value) EN_getnoderesults(EN_project p, int nodevariable, int nodeindex, double values) EN_getlinkresults(EN_project p, int linkvariable, int linkindex, double values)
AbelHeinsbroek commented 2 months ago

Thanks Lew, I've changed the name of the new functions.

I've done some more testing with the new methods:

from timeit import Timer
import epynet

net = epynet.Network('demo.inp')
net.solve()

l = len(net.nodes)

# get pressure for all nodes in the network

def a():
    pres = []
    for i in range(1,l+1):
        pres.append(net.ep.ENgetnodevalue(i, 11))
    return pres

def b():
    pres = net.ep.ENgetnodevalues(11)[:l]
    return pres

time_a = Timer(lambda: a()).timeit(number=1000)
time_b = Timer(lambda: b()).timeit(number=1000)

print(f"Number of nodes {l}")
print(f"Method result is equal: {a()==b()} ")

print()

print(f"Method a took {time_a:.3f} milliseconds")
print(f"Method b took {time_b:.3f} milliseconds")
print()
print(f"Method b is {(time_a/time_b):.1f} times faster")

Which results in:

Number of nodes 32011
Method result is equal: True 

Method a took 11.019 milliseconds
Method b took 0.683 milliseconds

Method b is 16.1 times faster