Deltares / HYDROLIB-core

Core code around the I/O of the DHYDRO-suite
https://deltares.github.io/HYDROLIB-core/
MIT License
25 stars 4 forks source link

dimr_config for parallel run is incorrect #562

Closed veenstrajelmer closed 5 months ago

veenstrajelmer commented 1 year ago

Describe the bug When generating a dimr_config.xml with hydrolib for an MPI run, we need to pass the process and mpiCommunicator arguments. The expected dtype for process is int, but this should not be directly passed but converted to a space-separated range. For 4 cores, the entry in the dimr should be <process>0 1 2 3</process>, but instead it reads <process>4</process>.

To Reproduce The file resulting from the code below contains an invalid <process> entry, which makes the model run crash:

from hydrolib.core.dimr.models import DIMR, FMComponent
fm_comp = FMComponent(name="test", workingDir='.', inputfile='test.mdu',
                      process=4, mpiCommunicator="DFM_COMM_DFMWORLD")
dimr = DIMR(component=fm_comp)
dimr.save(filepath='dimr_config.xml')

Expected behavior <process>0 1 2 3</process> in the dimr_config.xml. It might be necessary to align processor/core numbers between different components of the DIMR file, but that is currently unsure. This must also be fixed for reading the xml file

Version info (please complete the following information):

Additional request Since the mpiCommunicator is always the same ("DFM_COMM_DFMWORLD"), we could also consider to always write it to the dimr_config if the amount of processes is >1. One argument less, always nice.

tim-vd-aardweg commented 7 months ago

So, just to be sure. If the processes = 5, we should write: <process>0 1 2 3 4</process>?

Should we still support the files that we have written incorrectly? In other words, should we still be able to parse: <process>4</process>?

@veenstrajelmer & @rhutten is this correct?

MRVermeulenDeltares commented 7 months ago

Currently I have assumed the following:

This change is only for FM and not RR.

When 0 is input for process the dimr_config.xml will not have added. --> correct

When 1 is input for process the dimr_config.xml will have 0 added. --> @rhutten will verify this with the Rekenhart team.

open questions: What should happen if an user inputs something different than a int? e.g.

    component = FMComponent(
        name="test",
        workingDir=".",
        inputfile="test.mdu",
        process="Four",
        mpiCommunicator="DFM_COMM_DFMWORLD",
    )

or

    component = FMComponent(
        name="test",
        workingDir=".",
        inputfile="test.mdu",
        process=4.0,
        mpiCommunicator="DFM_COMM_DFMWORLD",
    )

Do we want to throw an exception or do we want to (silently) leave the process out? --> an exception should be thrown.

open question for a followup issue: Currently the abstract class component has process implemented. For the FM component the changes have been applied in the specified branch. For the RR component process can be given, but is not expected. This will still result in an output in the .xml file, should we handle this is some way? ---> see #624

    component = RRComponent(
        name="test",
        workingDir=".",
        inputfile="test.mdu",
        process=4,
        mpiCommunicator="DFM_COMM_DFMWORLD",
    )

e5poFbAFEW

rhutten commented 7 months ago

If process = 0 or 1 is selected, no process keyword should be written to dimr_config.xml

MRVermeulenDeltares commented 7 months ago

Discussed with @rhutten: Should we still support the files that we have written incorrectly? In other words, should we still be able to parse: <process>4</process>? --> No, but a warning should be given.

When 0 or 1 is input for process the dimr_config.xml will not have added --> No process will be added, a message will be given to inform te user "Note that for process 0 or 1 no sequential error will be given."

rhutten commented 7 months ago

Correction on the last message, for process only 1 can be added. Please give the warning specified above. For 0-process is not possible and an error should be given. Error message: the keyword process can not be 0, please specify values 1 or greater.

MRVermeulenDeltares commented 6 months ago

Discussed with @Tim & @Jelmer, conclusion:

acceptance criteria:

veenstrajelmer commented 6 months ago

One more addition, according to the manual this is also supported: <process>0:4</process> Might be good to also support for this notation in hydrolib-core, maybe as part of a separate issue if not convenient now.

MRVermeulenDeltares commented 6 months ago

@veenstrajelmer, currently added the support for 0:4, the assumption is made that this is the same as <process>0 1 2 3 4</process>. The last value is used to determine the process value. when writing it will thus be written as <process>0 1 2 3 4</process>. This means, <process>2:4</process> will be assumed as process = 5, and will be written as <process>0 1 2 3 4</process>.

Further support for this can be done in a seperate issue. Acceptance criteria for followup-issue is to be refined. I will need to discuss the current additional support with @rhutten.

veenstrajelmer commented 6 months ago

@MRVermeulenDeltares: <process>2:4</process> would (I think) translate to <process>2 3 4</process> and therfore as FMComponent(process=3), when writing it again, it will result in <process>0 1 2</process> I guess.

MRVermeulenDeltares commented 6 months ago

Discussed with @rhutten, I will create a follow-up issue. We can keep the above described as is and determine the acceptance criteria for the full support in the follow-up issue. since fmcomponent expects an int for process, a start and end are currently not supported, thus reading/writing of a semicolon input can only be partialy supported.

In the follow-up issue we will need to discuss what how the user should be able to apply start and end value for process. This way we can also determine how to add support to read and write the semicolon value for process. The acceptance criteria for the follow-up can be further refined with input from "D-Flow_FM-User_manual 17.1.3 Input DIMR" Maybe we can also request more information on what the numbers actualy mean, e.g. 16:31, would this mean run on core 16 to 31?

@veenstrajelmer, If in your opinion the behaviour you have described is more desirable, I can have a look to implement that in this issue. Any other update to support semicolon input and output should be handled in the follow-up issue.

Update: suggestion implemented

veenstrajelmer commented 5 months ago

@MRVermeulenDeltares thanks a lot for implementing this. This linked branch was tested and the code now indeed behaves as it should.