Different output for v3.0 and v3.1

ec147 commented 1 year ago

Gtau_3_1 Gtau_3_0

I have made two calculations with CT-HYB, one with the version 3.0 and one with the version 3.1. Both have strictly the same parameters and same G0(w) as inputs. Yet the G(tau) output of version 3.1 is very noisy and highly non-physical (first picture) while the output of version 3.0 is satisfactory (second picture). The calculation is parallelized over 2048 CPUs.

Do you have any idea of the cause of this discrepancy between both versions ?

I'm putting attached the C++ code I used ; which is part of the DFT code Abinit, which gives me the G0(w) and U matrix as input for CTHYB.

the-hampel commented 1 year ago

Dear @ec147,

that is indeed quite odd. I had a brief look into your code and from a first glance this looks all good. In principle the only changes happened from triqs 3.0 to 3.1 that could really influence this are the stat changes in TRIQS itself (@Wentzell correct me if I am wrong). Within cthyb the changes are minimal.

We have several benchmark scripts: https://github.com/TRIQS/benchmarks and I think they have been tested with 3.1.x without problems. Moreover, your 3.1.x result looks really wrong, so that something must be wrong here.

Did I see correctly that you stored the G0_iw to text file. Are those identical? Can you provide the std output from the solver. I would like to check if the solver worked with the same local Hamiltonian, detected the same number of subspaces, and reported similar acceptance rates.

Best, Alex

ec147 commented 1 year ago

Thanks for your feedback. I found the issue and easily fixed it ; in the latest version of the mpi dependency, the MPI environment is activated with the variable has_env, which is set to True if one of the following environment variables is found: OMPI_COMM_WORLD_RANK, PMI_RANK or CRAY_MPICH_VERSION. However, I'm using a SLURM environment which has a different environment variable (SLURM_PROCID I think).

the-hampel commented 1 year ago

Glad to hear that the issue is resolved for you. May I ask how you solved it? In principle we rely on this MPI detection feature to work. If there is any cluster environment where it does not work out of the box please let us know. We are happy to add additional environment variable checks. Best, Alex

ec147 commented 1 year ago

Sure ; I simply replaced the line 44 of the mpi.hpp header file by "if (std::getenv("SLURM_PROCID") != nullptr or std::getenv("OMPI_COMM_WORLD_RANK") != nullptr or std::getenv("PMI_RANK") != nullptr or std::getenv("CRAY_MPICH_VERSION") != nullptr)" .

the-hampel commented 1 year ago

Interesting. I understand that SLURM_PROCID will work here, but it is a bit dangerous to add this generally for us since SLURM_PROCID could also be used in combination with non MPI jobs when using srun (correct me if I am wrong). This is just the process ID allocated from slurm. Are you using MPICH, openmpi, or similar?

@Wentzell do you understand why our MPI detection fails in this case?

ec147 commented 1 year ago

Yes, I just checked and it seems like the environment variable SLURM_PROCID is also set even for sequential runs, so my way is not the proper way to fix the issue. I just wanted to find an easy workaround without thinking too much about it, and this is not a problem for me since I'm always parallelizing my runs, so I always want the MPI environment to be activated. I'm really not an expert on SLURM environments, so I cannot really help you further unfortunately.

I'm using openmpi.

Wentzell commented 1 year ago

I agree that SLURM_PROCID is the wrong solution here. Which version of openmpi are you using? It looks like OMPI_COMM_WORLD_RANK is not set, while it should be?

ec147 commented 1 year ago

I'm using v4.1.4.4 of openmpi. If my understanding is correct, the variable OMPI_COMM_WORLD_RANK is set when the command mpirun is launched. However, my environment uses an abstraction layer (Bridge) to SLURM, and the MPI run is launched by the command ccc_mprun, thus the variable OMPI_COMM_WORLD_RANK is not set. This is very specific to my company, so I do not think this is a major issue for you.

the-hampel commented 1 year ago

Okay, I see. I wonder if we should add a cmake flag to enforce the MPI init, skipping the detection of an MPI environment (like the way it was before we introduced this check) to have a quick workaround in those cases?

Wentzell commented 1 year ago

@the-hampel Maybe we could just check if TRIQS_FORCE_MPI_INIT was set in the environment (in the same line where we have the other checks)?

the-hampel commented 1 year ago

I think I like that idea. Let me add this and try it out.

the-hampel commented 1 year ago

I added two PR's to add the feature. One in triqs: https://github.com/TRIQS/triqs/pull/883 to check in the Python layer, and one in triqs/mpi itself: https://github.com/TRIQS/mpi/pull/11 . This allows to do this:

(triqs-dev) >python sumk_test.py
Warning: could not identify MPI environment!
Starting serial run at: 2023-06-05 05:06:20.907482

(triqs-dev) >export TRIQS_FORCE_MPI_INIT=1

(triqs-dev) >python sumk_test.py
Starting run with 1 MPI rank(s) at : 2023-06-05 05:06:27.285073

If this looks good please merge.

Wentzell commented 1 year ago

Thank you @the-hampel, these pull requests have both been merged. This resolves the Problem described here, so I am closing the issue.

TRIQS / cthyb

Different output for v3.0 and v3.1 #166