JoshCu commented 3 months ago

Update the formulation manager to use boost::regex instead of std.

When matching complex regex for finding forcing files, std::regex becomes a significant portion of Ngen init. Especially with large numbers of catchments per core.

For an extreme example, when running serially for wb-479197 and it's upstreams ~6500 catchments. NGen::init takes ~153s, if I use the line "forcing": {"file_pattern": ".*{{id}}.*.csv" in my realization. removing those wildcards reduces the time to 69 seconds.

For use cases where ngen is being run over a large area, this init time can become a non-insignificant portion of the total runtime, in the serial example, it took 153 seconds to init, 43s to run the models, 19s to route for a 24h simulation.

Changes

std::regex -> boost::regex

Testing

This was all tested on ngiab images using ngen f91e2ea The run package was generated using the ngiab_preprocessor 24h run of ~6500 catchments, sloth + cfe + noaa-owp-modular + troute, subset from wb-479197 on hydrofabric v20.1

Hardware used

Model name: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 RAM 2100mhz DDR4 Storage ~500mbs sata ssd

Realization.json

```json { "global": { "formulations": [ { "name": "bmi_multi", "params": { "name": "bmi_multi", "model_type_name": "bmi_multi", "main_output_variable": "Q_OUT", "forcing_file": "", "init_config": "", "allow_exceed_end_time": true, "modules": [ { "name": "bmi_c++", "params": { "name": "bmi_c++", "model_type_name": "SLOTH", "main_output_variable": "z", "init_config": "/dev/null", "allow_exceed_end_time": true, "fixed_time_step": false, "uses_forcing_file": false, "model_params": { "sloth_ice_fraction_schaake(1,double,m,node)": 0.0, "sloth_ice_fraction_xinanjiang(1,double,1,node)": 0.0, "sloth_soil_moisture_profile(1,double,1,node)": 0.0 }, "library_file": "/dmod/shared_libs/libslothmodel.so", "registration_function": "none" } }, { "name": "bmi_fortran", "params": { "name": "bmi_fortran", "model_type_name": "NoahOWP", "library_file": "/dmod/shared_libs/libsurfacebmi.so", "forcing_file": "", "init_config": "./config/cat_config/NOAH-OWP-M/{{id}}.input", "allow_exceed_end_time": true, "main_output_variable": "QINSUR", "variables_names_map": { "PRCPNONC": "precip_rate", "Q2": "SPFH_2maboveground", "SFCTMP": "TMP_2maboveground", "UU": "UGRD_10maboveground", "VV": "VGRD_10maboveground", "LWDN": "DLWRF_surface", "SOLDN": "DSWRF_surface", "SFCPRS": "PRES_surface" }, "uses_forcing_file": false } }, { "name": "bmi_c", "params": { "name": "bmi_c", "model_type_name": "CFE", "main_output_variable": "Q_OUT", "init_config": "./config/cat_config/CFE/{{id}}.ini", "allow_exceed_end_time": true, "fixed_time_step": false, "uses_forcing_file": false, "registration_function": "register_bmi_cfe", "variables_names_map": { "water_potential_evaporation_flux": "EVAPOTRANS", "atmosphere_water__liquid_equivalent_precipitation_rate": "QINSUR", "ice_fraction_schaake": "sloth_ice_fraction_schaake", "ice_fraction_xinanjiang": "sloth_ice_fraction_xinanjiang", "soil_moisture_profile": "sloth_soil_moisture_profile" }, "library_file": "/dmod/shared_libs/libcfebmi.so.1.0.0" } } ], "uses_forcing_file": false } } ], "forcing": { "file_pattern": "{{id}}.csv", "path": "./forcings/by_catchment/", "provider": "CsvPerFeature" } }, "time": { "start_time": "2010-01-01 00:00:00", "end_time": "2010-01-02 00:00:00", "output_interval": 3600, "nts": 288.0 }, "routing": { "t_route_config_file_with_path": "/ngen/ngen/data/config/ngen.yaml" }, "output_root": "/ngen/ngen/data/outputs/ngen" } ```

boost+exact first checks if file_pattern matched the file exactly, then runs regex if it does not match

exact path is where the "path" formulation variable accepts the {{id}} placeholder allowing for file_pattern to be omitted and no regex being run.

Serial

library	.{{id}}..cat	{{id}}.cat
std	153s	69s
boost	62s	55s
re2	37s	36s
exact path	NA	27s

10 mpi ranks

library	{{id}}.cat
std	8.6s
boost	7.4s
re2	5.2s
exact path	4.2s

56 mpi ranks

library	{{id}}.cat
std	4.8s
boost	4.5s
re2	3.9s
exact path	3.2s

Screenshots

interactive version of the graph here

std, boost, re2, on 10 cores

![image](https://github.com/NOAA-OWP/ngen/assets/26720477/a5916aba-5669-4c39-8fa3-dc938c7cdfee)

perf Flamegraph of unmodified fomulation manager running serially with double wildcards

![kernel](https://github.com/NOAA-OWP/ngen/assets/26720477/7d892169-4817-4942-9598-c7496c88ab6b)

Notes

re2 is faster, but I don't know if it's worth adding the dependency when we're already using boost
If it's worth it, I can open a PR for a version using re2
I'm not sure how common these wildcarded regex use cases are, or how many people would be running serially for large numbers of catchments
read.cpp in the geopackage code also uses std::regex, but it's called so infrequently that the performance benefit is negligible. I left it unchanged to reduce code modification, although I'm unsure if that's the correct decision.
changing {{id}}.cat to {{id}}\.cat reduces all times further, but I noticed the lack of escapement after I'd finished all my testing
I can upload the various docker images used to test to dockerhub if needed

Future work

I wanted to keep changes minimal because I'm not too familiar with the rest of ngen, but would it be worth modifying the formulation manager further to optionally disable regex completely? For my use-case, the forcing files are just named cat-1234.csv and there is one per catchment so I don't need to use regex at all after the formulation manager replaces the {{id}} placeholder.

PhilMiller commented 3 months ago

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

JoshCu commented 3 months ago

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

Yeah of course! Happy to do any additional testing, just let me know what I should try.

As for boost regex being header only, I don't think it needs to be built? the dockerfile I'm using just downloads boost.bzip, unzips it, then adds the folder to path before building ngen. I don't have to make install it or anything if that answers the question?

my docker patch is just two sed commands with no additional changes to how ngen's built

RUN sed -i 's|#include <regex>|#include <boost/regex.hpp>|g' include/realizations/catchment/Formulation_Manager.hpp
RUN sed -i 's/std::regex/boost::regex/g' include/realizations/catchment/Formulation_Manager.hpp

the dockerfile is this one if it's of any use/interest :)

JoshCu commented 3 months ago

@PhilMiller What about changing path property to also accept {{id}}? Changing this

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
}

to

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
    int id_index = path.find("{{id}}");
    if (id_index != std::string::npos) {
        path = path.replace(id_index, sizeof("{{id}}") - 1, identifier);
    }
}

means that I can put the {{id}} in the path, then remove file_pattern from my realization so that it hits the early return before any regex is used. Testing it gives me 27s in serial and 3.2s in parallel 56 cores

PhilMiller commented 3 months ago

The idea for {{id}} substitution in path sounds good to me. I don't know job setups too well, though, so I'll ask Bobby and/or Austin to comment on that.

Meanwhile, I added lines in the PR description's timing tables for "boost+exact" describing the code as amended to check against literal filepattern. Could you please put up numbers for that the current PR code?

PhilMiller commented 3 months ago

If you're feeling diligent, std+exact might be helpful too, if we conclude that we don't want to add the dependency on Boost regex.

aaraney commented 3 months ago

@PhilMiller, thanks for looping me in. I don't foresee this being an issue. @robertbartel, do you feel differently?

robertbartel commented 3 months ago

Well, according to this, Boost.Regex is not header-only (someone sanity check me to make sure I'm not missing something).

I don't see a problem with adding support for {{id}} replacement within forcing.path. It seems to me like something worth doing anyway. We don't organize data like this in DMOD right now, though we could consider adapting to it, and I could easily see some users naturally wanting (i.e., regardless of performance implications) to arrange forcings files and other data like BMI configs together in catchment-specific directories.

JoshCu commented 3 months ago

I was surprised the timings with the filepattern exact check before attempting the regex weren't measurably faster. Without attaching a debugger I'd guess that the filepattern match only returns true at most once per pattern, but still runs the regex against every other non-exact file match. So in this example it would only prevent regex being used 1/6500th of the time? I'll rerun to confirm and update the tables

PhilMiller commented 3 months ago

Oh, yeah, if we're going to try for exact match, we should just try opening the specific file, rather than testing against the entire enumerated contents of the directory.

JoshCu commented 1 month ago

@PhilMiller If this is still of interest, what are the next steps?

Options

Add boost regex?
Add {{id}} substitution to the path variable?
Try to open the file, if that fails, then begin regex matching?

Any combination of these work, but 2 might not be needed with 3 as it is only marginally quicker than 3

1 | This is the smallest code change, a moderate speed improvement, ~but boost.regex is not header only~ https://github.com/NOAA-OWP/ngen/pull/843#issuecomment-2315810082

2 | This is the fastest, slightly larger change, but doesn't verify that the file exists before accepting it

3 | Slightly slower than 2, much larger change, seems to be a good option assuming my code is ok?

I'm happy to redo any changes and put then in a new pull request if needed :)

program-- commented 1 month ago

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

edit: looks like in 1.76 boost.regex changed to being header-only.

Regex:

Regex is now header only except in C++03 mode.

Support for C++03 is now deprecated.

The library can now be used "standalone" without the rest of Boost being present.

JoshCu commented 1 month ago

Is that why I didn't have to do anything other than building boost for the regex change to work? The highest version in rocky 9 epel is/was 1.75 so I had to build 1.79 from source anyway https://github.com/JoshCu/NGIAB-CloudInfra/blob/7cef18572883f4b9ec04471b650a87f860e488e3/docker/Dockerfile#L28

program-- commented 1 month ago

Is that why I didn't have to do anything other than building boost for the regex change to work? ...

Yeah that'd be my assumption. Another verification that boost.regex doesn't need any additional building: our CI builds the changes successfully, and we only link to the Boost::boost/Boost::headers CMake target https://github.com/NOAA-OWP/ngen/actions/runs/9687130620/job/26813637727?pr=843#step:3:594

For older boost versions, boost.regex requires the regex component when calling find_package(Boost ...) to include linking to the compiled code AFAIK, otherwise it's only headers.

edit: to summarize, I think we are good to use boost.regex without worrying about extra dependencies.

robertbartel commented 1 month ago

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

Thanks, @program--. Looks like they (still) haven't updated Boost's Getting Started page to clearly reflect that and I didn't look hard enough earlier. If what's stated here in the dependency doc is still correct, a non-header-only scenario for this shouldn't be possible.

A bit of an aside: I wanted to quickly see if any other Boost libraries listed as "must be built" didn't really belong there, so I took a brief look at Boost.Chrono (the first one in that list). Besides details on Boost.Chrono directly, I came across this interesting line about its Boost.System dependency:

Boost.System has an undocumented feature (use of macro BOOST_ERROR_CODE_HEADER_ONLY) to make it header only.

So, buyer beware on any of the Getting Started "must be built" libraries.

NOAA-OWP / ngen

Update Formulation_Manager.hpp to use boost regex #843

Update the formulation manager to use boost::regex instead of std.

Changes

Testing

Serial

10 mpi ranks

56 mpi ranks

Screenshots

Notes

Future work

Options