AlabamaWaterInstitute / CloudInfra

NextGen In A Box: NextGen Framework National Water Model Community Release
5 stars 22 forks source link

ngen-parallel mode is not working. #24

Closed arpita0911patel closed 1 year ago

arpita0911patel commented 1 year ago

Current behavior

While trying to use "AWI_16_680661_001" data for the model, and selecting parallel run, it's throwing below error: "Missing required argument for partition file path."

Expected behavior

Should create the output files.

Steps to replicate behavior (include URLs)

apatel54@UA-W2RP43G:~/Desktop/ngen_data|⇒ docker run --rm -it -v "$(pwd)"/AWI_16_680661_001:/ngen/AWI_001 --platform=linux/amd64 awiciroh/ciroh-ngen-image:latest Working directory is /ngen Found these Catchment files: /ngen/ngen/data/catchment_data.geojson /ngen/ngen/data/catchment_data_test1.geojson /ngen/ngen/extern/cfe/cfe/params/data/hydrofabrics/releases/beta/01a/catchments.geojson /ngen/ngen/extern/topmodel/topmodel/params/data/hydrofabrics/releases/beta/01a/catchments.geojson /ngen/AWI_001/config/catchments.geojson Found these Nexus files: /ngen/ngen/data/nexus_data.geojson /ngen/AWI_001/config/nexus.geojson Found these Realization files: /ngen/ngen/data/example_bmi_multi_realization_config_w_routing.json /ngen/ngen/data/example_bmi_multi_realization_config_w_noah_pet_cfe.json /ngen/ngen/data/example_bmi_multi_realization_configmacos.json /ngen/ngen/data/test_bmi_multi_realization_config_w_netcdf.json /ngen/ngen/data/lstm/example_lstm_realization_config.json /ngen/ngen/data/test_realization_config.json /ngen/ngen/data/example_realization_config.json /ngen/ngen/data/test_bmi_multi_realization_config.json /ngen/ngen/data/example_bmi_multi_realization_config.json /ngen/ngen/data/example_realization_config_w_bmi_clin_mac.json /ngen/ngen/data/example_bmi_multi_realization_config_w_netcdf.json /ngen/ngen/data/test_bmi_multi_realization_config_w_noah_pet_cfe.json /ngen/ngen/extern/sloth/test/data/sloth_cfe_realization.json /ngen/ngen/extern/sloth/test/data/sloth_realization.json /ngen/ngen/extern/SoilFreezeThaw/SoilFreezeThaw/configs/realization_config_multi_linux.json /ngen/ngen/extern/SoilFreezeThaw/SoilFreezeThaw/configs/realization_config_multi_macos.json /ngen/ngen/extern/SoilFreezeThaw/SoilFreezeThaw/configs/realization_config_standalone_linux.json /ngen/ngen/extern/SoilFreezeThaw/SoilFreezeThaw/configs/realization_config_standalone_macos.json /ngen/ngen/extern/SoilMoistureProfiles/SoilMoistureProfiles/config/realization_config_smp_macos.json /ngen/ngen/extern/SoilMoistureProfiles/SoilMoistureProfiles/config/realization_config_smp_linux.json /ngen/AWI_001/config/awi_simplified_realization.json 1) ngen-parallel 2) ngen-serial 3) bash

? 1

Enter the hydrofabric catchment file path: /ngen/AWI_001/config/catchments.geojson /ngen/AWI_001/config/catchments.geojson selected Enter the hydrofabric nexus file path: /ngen/AWI_001/config/nexus.geojson /ngen/AWI_001/config/nexus.geojson selected Enter the Realization file path: /ngen/AWI_001/config/awi_simplified_realization.json /ngen/AWI_001/config/awi_simplified_realization.json selected

Your NGEN run command is ngen-parallel /ngen/AWI_001/config/catchments.geojson "" /ngen/AWI_001/config/nexus.geojson "" /ngen/AWI_001/config/awi_simplified_realization.json Copy and paste it into the terminal to run your model. The tested model is /dmod/bin/ngen-serial /ngen/data/catchment_data.geojson /ngen/data/nexus_data.geojson /ngen/ngen/data/example_realization_config.json If your model didn't run, or encountered an error, try checking the Forcings paths in the Realizations file you selected.

Your model run is beginning!

NGen Framework 0.1.0 Missing required argument for partition file path.

Screenshots

image
ZacharyWills commented 1 year ago

Hey Arpita!

In the last PR: https://github.com/AlabamaWaterInstitute/CloudInfra/pull/23/files

I fixed the compilation of the partition generator (which is its own binary).

Image

This "cuts" NGEN to allow for parallelism in a manner that's hydrologically consistent, and the ngen-parallel doesnt work without that partition file that it generates. That partition file is added as an additional argument at the end of the command.

So in this case without generating the file the error you got is the model framework recognizing that it doesnt want to parallelize without minding the hydrology.

ZacharyWills commented 1 year ago

the partition generator is copied to the /dmod/bin directory

arpita0911patel commented 1 year ago

Tried running the latest image that has the fix: apatel54@UA-W2RP43G:~/Desktop/ngen_data|⇒ docker run --rm -it -v "$(pwd)"/AWI_16_680661_001:/ngen/AWI_001 --platform=linux/amd64 awiciroh/ciroh-ngen-image:latest

bash-4.4# cd /dmod/bin/

bash-4.4# ./partitionGenerator /ngen/AWI_001/config/catchments.geojson /ngen/AWI_001/config/nexus.geojson AWI_001_partition_file 5 '' '' Partitioning 210 catchments into 5 partitions. Validating catchments...

Number of catchments is: 210 Catchment validation completed Found 9 remotes in partition 0 Found 13 remotes in partition 1 Found 2 remotes in partition 2 Found 12 remotes in partition 3 Found 10 remotes in partition 4 Found 46 total remotes (average of approximately 9 remotes per partition)

bash-4.4# ./ngen-parallel /ngen/AWI_001/config/catchments.geojson "" /ngen/AWI_001/config/nexus.geojson "" /ngen/AWI_001/config/awi_simplified_realization.json AWI_001_partition_file NGen Framework 0.1.0 Building Nexus collection file read success file_path: AWI_001_partition_file

root_tree: 1 Building Catchment collection terminate called after throwing an instance of 'std::runtime_error' what(): Can't init CFE; unreadable shared library file './extern/cfe/cmake_build/libcfebmi.so.1.0.0' qemu: uncaught target signal 6 (Aborted) - core dumped Aborted

Seeing this error.

Arpita