diatomic / diy

data-parallel out-of-core library
Other
47 stars 21 forks source link

diy::Master::exchange() error when nblock < nproc #29

Open mrzv opened 9 years ago

mrzv commented 9 years ago

Issue by Hadrien Croubois Monday Mar 02, 2015 at 21:02 GMT


When the total number of block is smaller then the number of procecess running in parallele, calling diy::Master::exchange() gives an error

[Archteryx:13682] *** Process received signal ***
[Archteryx:13682] Signal: Floating point exception (8)
[Archteryx:13682] Signal code: Integer divide-by-zero (1)

Is that a known problem ? I'll try and have a lot a where it comes from

mrzv commented 9 years ago

Comment by Dmitriy Morozov Monday Mar 02, 2015 at 21:38 GMT


I've never run it in this regime, so it's not a known problem. But it's not difficult to guess what may be going wrong. There is a division by size() in flush(). That will definitely fail (since size is 0). There may be other problems.

I should note that this is not a regime I've thought about before. Other things might be failing as well.

mrzv commented 9 years ago

Comment by Hadrien Croubois Monday Mar 02, 2015 at 21:46 GMT


I understand that's not something you might expect, still you might run into it in specific cases :

  1. large number of node for pipeline work
  2. low throughput
  3. application where comupational complexity is low compare to comunication cost
    • you end up with large blocs being deployed on a small portion of your nodes

I should note that this is not a regime I've thought about before. Other things might be failing as well.

That's what I'm looking at

mrzv commented 9 years ago

Comment by Dmitriy Morozov Monday Mar 02, 2015 at 21:54 GMT


Oh, I don't question the usefulness of the setting. Just acknowledging that I haven't thought about it before.

I'll fix this problem when I get a chance. Meanwhile, you can see if something simple, like setting out_queues_limit = 0 in flush(), if size() == 0 solves it for you.

mrzv commented 9 years ago

Comment by Dmitriy Morozov Tuesday Mar 03, 2015 at 01:09 GMT


e63cbea might fix the immediate problem (in the way I described), but there are deeper problems with the design in this case. (DIY collectives break, as does the pattern of loading the data via collective IO during decomposition.) I need to think through this to figure out the best solution.