davidkleiven / CEMC

DEPRECATED: Monte Carlo package targeted at systems studied with the Cluster Expansion.
MIT License
3 stars 2 forks source link

The database has to be prepared prior to calling get_ce_calc #65

Open phymalidoust opened 5 years ago

phymalidoust commented 5 years ago

Hi,

When do we get this error? " The database has to be prepared prior to calling get_ce_calc "

This error arises when I call 'get_ce_calc' in 2 processors or more. But when I call it by a single processor it passes without raising this error.

davidkleiven commented 5 years ago

It is because if the DB does not exist, only one processor can write. The problem is that the write statement is so slow that you get a race condition. 1 processor start to write, the others see that the entry is there and tries to read. But since the write is not finished the read fails. It is the CE code that creates the DB so it needs to be handled there. As of now you don't get any speedup by running that part in parallel, but maybe in the future.

Den fre. 11. jan. 2019, 11.11 skrev phymalidoust <notifications@github.com:

Hi,

When do we get this error? " The database has to be prepared prior to calling get_ce_calc "

This error arises when I call 'get_ce_calc' in 2 processors or more. But when I call it by a single processor it passes without raising this error.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/davidkleiven/CEMC/issues/65, or mute the thread https://github.com/notifications/unsubscribe-auth/AMg_T-UNwLRrJ-ov3Era4aZ_iuJn5uF0ks5vCGM7gaJpZM4Z7KLW .

phymalidoust commented 5 years ago

I first thought that exactly what you say is happening. Then forced the processors to pop-up one after another and take their own copies. However, it stucks at the master processor. Even the first processor creates its own DB etc but stucks in this function "get_ce_calc" and doesn't allow the other ones to pop-up.

phymalidoust commented 5 years ago

This routine "get_ce_calc" prints this message below two times before initializing the calculations. However, it stops when prints this message for the first time. Maybe it gives some clues to find out how get rid of this?

Getting symbols from BC object Getting cluster names from atoms object Finished reading cluster_info Reading basis functions from BC object Reading translation matrix from BC Reading translation matrix from list of dictionaries Inserted 2688 into the translation matrix Parsing correlation function CEUpdater initialized sucessfully!

davidkleiven commented 5 years ago

Yes, that's just for debugging, it will be removed in the future. I suspect that the CE code maybe will use MPI to parallelize certain aspects of the code, and it will be resolved there. So a hack now is to first run get_ce_calc with a single processor then keep the database, and the next time call it using MPI. If the database exist then all cores will just read and that is fine.

phymalidoust commented 5 years ago

Doesn't help. It says for the first time that "CEUpdater initialized sucessfully!" but stuck in the second round.

davidkleiven commented 5 years ago

Do you have a minimal example that demonstrates this? What do you mean by "Maybe it gives some clues to find out how to get rid of this"?

phymalidoust commented 5 years ago

Here is a minimal example. I meant it can be around where this error happens (attached mpi_test.py.zip

).

davidkleiven commented 5 years ago

@phymalidoust This script is not supposed to work with MPI. What are you trying do parallalise over?

davidkleiven commented 5 years ago

As it is now calling MPI will just result in exactly the same instructions running on multiple processors, which is likely to cause IO problems.

phymalidoust commented 5 years ago

@phymalidoust This script is not supposed to work with MPI. What are you trying do parallalise over?

The script is quite simplified. It was supposed to make few databases for the large structures with multiple nodes in parallel. By this script, the processors don't work in parallel right now. One works after the previous one is done. This ensure that there's no mixup among the processors. The first processor even produces its own database but stuck somewhere after ...

davidkleiven commented 5 years ago

So you planned to create many databases, one for each core? In that case why don't you just include the rank in the database name and all other files used for IO?

phymalidoust commented 5 years ago

Actually not planned to produce many dbs in actually calculations. I just narrowed down the code to isolate the issue up to 'get_ce_calc' and this simplified code shows in what function a multiprocessor computation stop.

davidkleiven commented 5 years ago

This script runs without errors using MPI with four cores. I think the reason why it stops is that when running MPI get_ce_calc assumes that all cores enter. As exceptions may occure on some or all cores during initialisation, each processor sends a message to the others if an exception occures and the program exits. If you stop the processors outside, such that they enter one at the time, it will be stuck for ever because the processor never reaches the "control point" where they send a message confirming that the initialisation was ok.

phymalidoust commented 5 years ago

This is strange it works there OK. When I run it with 4 processors and let these 4 to have access to 'get_ce_calc' simultaneously it stops with this error: 'The database has to be prepared prior to calling get_ce_calc'.

davidkleiven commented 5 years ago

Yes, so you first has to prepare it, then run in parallel. Maybe I will change this, we don't have to that. One processor can create it, the other can wait. And then they can read. Actually, I will do that. Good that you brought this up.