StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
675 stars 146 forks source link

problem with large regions with pygion #1186

Open marsupialtail opened 2 years ago

marsupialtail commented 2 years ago

I am running into some problems with using large regions in Pygion.

Here is my code:

from __future__ import print_function
import pygion
from pygion import R, task, Region, RW, WD, N
import numpy as np
import pandas as pd
import time

def region_to_df(region):
    df = pd.DataFrame(columns = region.keys())
    for key in region.keys():
        df[key] = getattr(region,key)
    return df

def df_to_region(df):
    bump = Region([len(df)], {i:pygion.float64 for i in df.keys()})
    for key in df.columns:
        getattr(bump,key)[:] = df[key]
    return bump

@task()
def input_csv(name):
    print(name, time.time())
    a = pd.read_csv(name)
    result =  df_to_region(a)
    print(name, time.time())
    return result

@task(privileges=[R,R])
def join(table1, table2):

    df1 = region_to_df(table1)
    df2 = region_to_df(table2)

    result = df1.merge(df2, on = "key" ,how='inner',suffixes=('_a','_b'))
    return df_to_region(result)

@task()
def main():

   tables = []
   inputs = ["a-big.csv", "b-big.csv"]
   for x in pygion.IndexLaunch([2]):
      tables.append(input_csv(inputs[x]))
   result = join(tables[0].get(),tables[1].get()).get()
   print(region_to_df(result))

if __name__ == '__main__':
    main()

The files referenced have been uploaded to here: https://drive.google.com/drive/folders/1wdQ3PTlTmIEl91BITfTxb5eHvn1ATa5a?usp=sharing

The stack trace and execution command:

mpirun -np 2 --bind-to none ./legion_python examples/join.py -ll:py 1
a-big.csv 1643839455.885158
b-big.csv 1643839455.9539144
[0 - 7f9d6a02a840]    3.587531 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 0 of operation Mapping (UID 36) in parent task __main__.input_csv (UID 34) is using uninitialized data for field(s) 1048578,1048580,1048582,1048584,1048586,1048588,1048590,1048592,1048594,1048596,1048598,1048600,1048602,1048604,1048606,1048608,1048610,1048612,1048614,1048616,1048618,1048620,1048622,1048624,1048626,1048628,1048630,1048632,1048634,1048636,1048638,1048640,1048642,1048644,1048646,1048648,1048650,1048652,1048654,1048656,1048658,1048660,1048662,1048664,1048666,1048668,1048670,1048672,1048674,1048676,1048678,1048680,1048682,1048684,1048686,1048688,1048690,1048692,1048694,1048696,1048698,1048700,1048702,1048704,1048706,1048708,1048710,1048712,1048714,1048716,1048718,1048720,1048722,1048724,1048726,1048728,1048730,1048732,1048734,1048736,1048738,1048740,1048742,1048744,1048746,1048748,1048750,1048752,1048754,1048756,1048758,1048760,1048762,1048764,1048766,1048768,1048770,1048772,1048774,1048776,1048778 of logical region (8,2,2) (from file /home/ubuntu/control-replication/legion/runtime/legion/legion_ops.cc:1203)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

a-big.csv 1643839456.923279
legion_python: /home/ubuntu/control-replication/legion/runtime/legion/runtime.cc:1578: void Legion::Internal::FutureImpl::finish_set_future(): Assertion `!future_size_set || (((canonical_instance == NULL) ? 0 : canonical_instance->size) <= future_size)' failed.
*** Caught a fatal signal (proc 0): SIGABRT(6)
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace. 
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

[ip-172-31-48-233:12900] *** Process received signal ***
[ip-172-31-48-233:12900] Signal: Segmentation fault (11)
[ip-172-31-48-233:12900] Signal code: Address not mapped (1)
[ip-172-31-48-233:12900] Failing at address: 0x30
[ip-172-31-48-233:12900] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f70da1fd040]
[ip-172-31-48-233:12900] [ 1] /opt/amazon/openmpi/lib/openmpi/mca_pmix_pmix3x.so(PMIx_Finalize+0x527)[0x7f70d313ccd7]
[ip-172-31-48-233:12900] [ 2] /opt/amazon/openmpi/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_client_finalize+0x3a2)[0x7f70d30de022]
[ip-172-31-48-233:12900] [ 3] /opt/amazon/openmpi/lib/openmpi/mca_ess_hnp.so(+0x38b1)[0x7f70d6d278b1]
[ip-172-31-48-233:12900] [ 4] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.6(+0x1aff1)[0x7f70da5c9ff1]
[ip-172-31-48-233:12900] [ 5] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.6(event_base_loop+0x53f)[0x7f70da5ca91f]
[ip-172-31-48-233:12900] [ 6] mpirun(+0x1072)[0x564278558072]
[ip-172-31-48-233:12900] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f70da1dfbf7]
[ip-172-31-48-233:12900] [ 8] mpirun(+0xdda)[0x564278557dda]
[ip-172-31-48-233:12900] *** End of error message ***
Segmentation fault (core dumped)

Commit commit 04d880fb2fbdbe128d311277d6a81a0bbf9d7cc5 (HEAD -> control_replication, origin/control_replication) Merge: a72f18ba2 22f0ffcc2 Author: Mike Bauer mike@lightsighter.org Date: Sat Jan 29 01:59:08 2022 -0800

legion: merge master into control replication and resolve conflicts
lightsighter commented 2 years ago

That's not the right stack trace, note it doesn't include this line where the assertion is triggered.

/home/ubuntu/control-replication/legion/runtime/legion/runtime.cc:1578

Run with REALM_FREEZE_ON_ERROR=1 in your environment, attach gdb, and dump the right stack trace from the thread that actually failed. It should have that line somewhere in the stacktrace.

lightsighter commented 2 years ago

Even without the stack trace, I can say that this looks like your application (or Pygion) is making a future that is much larger than the pre-reserved size for futures declared in legion_config.h. Note you should be able to increase this upper bound when you register the task variant saying precisely how big you expect the future outputs to be. I still want to see the proper stacktrace though to see how you're getting there without hitting the normal error message.