Realm: support for reductions to HDF5 instances

elliottslaughter commented 5 years ago

@mariodirenzo was hitting this assertion, which seems to be somewhere in Realm DMA code.

It looks suspiciously similar to #381, but I don't know if it's related at all.

113 prometeo.exec: /home/ctrsp-2018/mariodr/legion/runtime/realm/transfer/transfer.cc:1237: size_t Realm::TransferIteratorIndexSpace<N, T>::step(size_t, Realm::TransferIterator::AddressInfo&, unsigned int, bool) [with int N = 2; T = long long int; size_t = long unsigned int]: Assertion `0 && "no support for non-affine pieces yet"' failed.
114 Signal 6 received by process 15984 (thread 2b1c67b07700) at: stack trace: 12 frames
115   [0] = /usr/lib64/libc.so.6(+0x35270) [0x2b1c5093f270]
116   [1] = /usr/lib64/libc.so.6(gsignal+0x37) [0x2b1c5093f1f7]
117   [2] = /usr/lib64/libc.so.6(abort+0x148) [0x2b1c509408e8]
118   [3] = /usr/lib64/libc.so.6(+0x2e266) [0x2b1c50938266]
119   [4] = /usr/lib64/libc.so.6(+0x2e312) [0x2b1c50938312]
120   [5] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::TransferIteratorIndexSpace<2, long lo    ng>::step(unsigned long, Realm::TransferIterator::AddressInfo&, unsigned int, bool)+0x715) [0x2b1c4e80726f]
121   [6] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::ReduceRequest::perform_dma()+0x488) [    0x2b1c4e851f74]
122   [7] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::DmaRequestQueue::worker_thread_loop()    +0xaf) [0x2b1c4e8552ad]
123   [8] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(void Realm::Thread::thread_entry_wrapper<Rea    lm::DmaRequestQueue, &Realm::DmaRequestQueue::worker_thread_loop>(void*)+0x18) [0x2b1c4e85ba8c]
124   [9] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::KernelThread::pthread_entry(void*)+0x    1af) [0x2b1c4e86fa61]
125   [10] = /usr/lib64/libpthread.so.0(+0x7e25) [0x2b1c50edce25]
126   [11] = /usr/lib64/libc.so.6(clone+0x6d) [0x2b1c50a0234d]
127 Signal 6 received by process 15984 (thread 2b1c67d08700) at: stack trace: 12 frames
128   [0] = /usr/lib64/libc.so.6(+0x35270) [0x2b1c5093f270]
129   [1] = /usr/lib64/libc.so.6(gsignal+0x37) [0x2b1c5093f1f7]
130   [2] = /usr/lib64/libc.so.6(abort+0x148) [0x2b1c509408e8]
131   [3] = /usr/lib64/libc.so.6(+0x2e266) [0x2b1c50938266]
132   [4] = /usr/lib64/libc.so.6(+0x2e312) [0x2b1c50938312]
133   [5] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::TransferIteratorIndexSpace<2, long lo    ng>::step(unsigned long, Realm::TransferIterator::AddressInfo&, unsigned int, bool)+0x715) [0x2b1c4e80726f]
134   [6] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::ReduceRequest::perform_dma()+0x488) [    0x2b1c4e851f74]
135   [7] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::DmaRequestQueue::worker_thread_loop()    +0xaf) [0x2b1c4e8552ad]
136   [8] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(void Realm::Thread::thread_entry_wrapper<Rea    lm::DmaRequestQueue, &Realm::DmaRequestQueue::worker_thread_loop>(void*)+0x18) [0x2b1c4e85ba8c]
137   [9] = /home/ctrsp-2018/mariodr/legion/bindings/regent/libregent.so(Realm::KernelThread::pthread_entry(void*)+0x    1af) [0x2b1c4e86fa61]
138   [10] = /usr/lib64/libpthread.so.0(+0x7e25) [0x2b1c50edce25]
139   [11] = /usr/lib64/libc.so.6(clone+0x6d) [0x2b1c50a0234d]

elliottslaughter commented 5 years ago

@mariodirenzo Do you know if you're doing any layout transformations? (CC: @manopapad @magnatelee)

mariodirenzo commented 5 years ago

I'm not sure. What do you mean by layout transformations? You can find attached a reproducer of what I am doing in my solver. I have tested this reproducer on a single core and it works. reproducer.tar.gz

elliottslaughter commented 5 years ago

Could you get into a debugger and print layout_piece?

mariodirenzo commented 5 years ago

Can't find that command or symbol in a gdb attached to the solver after the assertion.

lightsighter commented 5 years ago

@jiazhihao Can you help with this until @streichler gets back from vacation?

elliottslaughter commented 5 years ago

@mariodirenzo If you're attaching to a process frozen with REALM_FREEZE_ON_ERROR=1 then run the threads command to list threads. The thread you want will be inside nanosleep. Switch to that thread with thread N where N is the thread number.

Then run bt to get the backtrace on that thread and figure out which frame number corresponds to Realm::TransferIteratorIndexSpace<N, T>::step. Run frame M where M is the frame number. At that point you should be able to do p/x layout_piece.

The output will be more readable if you add set print static-members off to ~/.gdbinit before you do this.

mariodirenzo commented 5 years ago

The output to p/x layout_piece is 0x2b178000f840

elliottslaughter commented 5 years ago

Sorry, what about p/x *layout_piece?

mariodirenzo commented 5 years ago

$3 = (Realm::HDF5LayoutPiece<2, long long>) {
  <Realm::InstanceLayoutPiece<2, long long>> = {
    _vptr.InstanceLayoutPiece = 0x2b17326f6938 <vtable for Realm::HDF5LayoutPiece<2, long long>+16>, 
    layout_type = 0x2, 
    bounds = {
      lo = {
        x = 0x0, 
        y = 0x0
      }, 
      hi = {
        x = 0x1f, 
        y = 0x0
      }
    }
  }, 
  members of Realm::HDF5LayoutPiece<2, long long>: 
  filename = {
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x2b178000f8d0
    }, 
    _M_string_length = 0x36, 
    {
      _M_local_buf = {0x36, 0x0 <repeats 15 times>}, 
      _M_allocated_capacity = 0x36
    }
  }, 
  dsetname = {
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x2b178000f8a0
    }, 
    _M_string_length = 0xd, 
    {
      _M_local_buf = {0x43, 0x6f, 0x6e, 0x73, 0x65, 0x72, 0x76, 0x65, 0x64, 0x5f, 0x61, 0x76, 0x67, 0x0, 0x0, 0x0}, 
      _M_allocated_capacity = 0x65767265736e6f43
    }
  }, 
  offset = {
    x = 0x0, 
    y = 0x0
  }
}

mariodirenzo commented 5 years ago

Do you have any update on this issue?

elliottslaughter commented 5 years ago

I think we need @jiazhihao to look at this; we're doing something with an HDF5 instance but more than that I'm not in a position to say.

In the mean time, a possible workaround might be to disable HDF5 support in the application code, so that you're not trying to use an HDF5 instance.

mariodirenzo commented 5 years ago

Actually that HDF file is the main outcome of the calculation that I need to perform. Disabling it would defeat the purpose of the calculation.

Do you know roughly when you will be able to look at this issue?

streichler commented 5 years ago

I should be able to take a look at this in the next day or two.

mariodirenzo commented 5 years ago

Thanks. Please, let me know if you need additional output from gdb.

streichler commented 5 years ago

@mariodirenzo, I'm not able to get this to reproduce with the provided code. Can you confirm which commit you're building against and show the exact command lines used to build regent and then run the test case?

mariodirenzo commented 5 years ago

The thing is that that reproducer in the comment from 10 days ago works on a single core, producing the correct output. It is just to show what I want to do. When I attempt the same process using the same hdf_helper in the full scale application, I get the reported error. Right now, I am using the version of the repository in the branch "nopaint" built using the command: DEBUG=1 USE_CUDA=1 USE_OPENMP=1 USE_GASNET=1 USE_HDF=1 scripts/setup_env.py --llvm-version 38

streichler commented 5 years ago

Ah, in that case, can you run the failing version with -level inst=1,dma=2,new_dma=2 -logfile dma.log and attach the dma.log file that is produced?

mariodirenzo commented 5 years ago

This is the log file that I get:

dma.log

streichler commented 5 years ago

@mariodirenzo I think I know what's wrong (reduction copies to hdf5 instances), but it's not a quick fix. As a potential workaround, can you try changing the tasks that have reduction privileges (e.g. AddStats in Example15.rg) to use read/write privileges instead? This may impact your ability to parallelize some loops, and there's some chance it'll cause a different fatal error, but let's give it a try and see what happens.

mariodirenzo commented 5 years ago

@streichler I get the following error:

Example15.rg:131: loop optimization failed: argument 2 interferes with itself
            AddStats(pr[c], rakes[int2d({0, rake})], 1.0/10)

elliottslaughter commented 5 years ago

@mariodirenzo Like @streichler said, it's possible it will cause the code to serialize. In this case, it would mean that you'd be unable to use an index launch. But if you're just interested in getting the code to work, you can comment out __demand(__parallel) and this error should go away.

mariodirenzo commented 5 years ago

The hdf output works after removing the __demand(__parallel) though there is a sensible slowdown when I use the code with a few hundreds of millions of points in the region to be sampled.

When do you think that this issue will be fixed?

streichler commented 5 years ago

The ETA is unknown, as we'd really like to fix this and some related issues with a more general solution.

Can we try another workaround? Try going back to the reduction privileges and the parallel loop, but add a dummy task after the double-nested loop that calls AddStats that read/writes the s region (or an index launch that read/writes the various ps[*] subregions) before the HDF dump task.

mariodirenzo commented 5 years ago

I've added a call this task

__demand(__inline)
task DummyAverages(Averages : region(ispace(int2d), Averages_columns))
where
    reads writes(Averages)
do
    -- Nothing
    -- It is just to avoid the bug of HDF libraries with parallel reduction
end

before the call to the dump task but it did not work. Could it be related to the __demand(__inline)? I would try directly but compiling the code takes hours.

streichler commented 5 years ago

Yes, the __demand(__inline) needs to go - it's allowing Regent to optimize this task out again.

mariodirenzo commented 5 years ago

Now it works and performs much better than the read write version. Should we keep this issue open till the new implementation will be available? In this way, I'll know when I can remove the dummy task.

Thanks for your help.

streichler commented 5 years ago

Yes, we'll keep this open, and I've edited the title so I'll remember what the actual issue is. :)

StanfordLegion / legion

Realm: support for reductions to HDF5 instances #521