GeoscienceAustralia / eqrm

Automatically exported from code.google.com/p/eqrm
Other
5 stars 4 forks source link

Crashes when using source file nat_zone_source_lvliiPGA.xml #13

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
David Burbridge is still having issues running his simulation. The application 
crashes when saving the event set:

Traceback (most recent call last):
  File "runhaz.py", line 40, in <module>
    run_model()
  File "runhaz.py", line 36, in run_model
    analysis.main(run,True,compress_output)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/analysis.py", line 167, in main
    (event_set, event_activity, source_model) = create_event_set(eqrm_flags, parallel)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/event_set.py", line 1531, in create_event_set
    generate_event_set(parallel, eqrm_flags)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/event_set.py", line 1390, in generate_event_set
    eqrm_flags.prob_number_of_events_in_zones)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/event_set.py", line 631, in generate_synthetic_events
    event.source_zone_id = asarray(source_zone_id)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/event_set.py", line 190, in <lambda>
    lambda self, value: self._set_file_array('source_zone_id', value))
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/file_store.py", line 136, in _set_file_array
    self._set_numpy_binary_array(name, array)
  File "/nas/gemd/ehp/georisk_earthquake/EQRM/sandpits/dburbidg/python_eqrm/trunk/eqrm_core/eqrm_code/file_store.py", line 126, in _set_numpy_binary_array
    save(filename, array)
  File "/usr/local/lib/python2.5/site-packages/numpy/lib/npyio.py", line 408, in save
    format.write_array(fid, arr)
  File "/usr/local/lib/python2.5/site-packages/numpy/lib/format.py", line 409, in write_array
    array.tofile(fp)
ValueError: 6761464 requested and 6274038 written

nat_zone_source_lvliiPGA.xml is the source file used to generate the event set. 
This file and the setdata file are attached.

Original issue reported on code.google.com by b...@girorosso.com on 24 Feb 2012 at 2:57

Attachments:

GoogleCodeExporter commented 9 years ago
Could not reproduce on tornado running on a single node. Was able to get past 
this point. Likely not to be a logic error. Now running an apples for apples 
comparison.

Original comment by b...@girorosso.com on 24 Feb 2012 at 3:54

GoogleCodeExporter commented 9 years ago
Ran on rhe-compute1 with 4 nodes without issue.

The exception originates from multiarray/convert.c in the numpy core library 
with this code:

n = fwrite((const void *)PyArray_DATA(self),                    
    (size_t) PyArray_DESCR(self)->elsize,                    
    (size_t) size, fp);

if (n < size) {                
    PyErr_Format(PyExc_ValueError,                        
        "%ld requested and %ld written",                        
        (long) size, (long) n);                
    return -1;            
}

where the return value from fwrite is the number of bytes written.

This is an IO error, and the file system being written to in this case is an 
NFS mount, which is potentially unreliable.

Will look into something that can be done to prevent this from happening.

Original comment by b...@girorosso.com on 24 Feb 2012 at 4:45

GoogleCodeExporter commented 9 years ago
Reproduced error on rhe-compute1 when running an adapted version David's run 
script. This is a different event set attribute which throws the exception but 
it is the same npyio operation:

Traceback (most recent call last):
  File "runhaz.py", line 40, in <module>
    run_model()
  File "runhaz.py", line 36, in run_model
    analysis.main(run,True,compress_output)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/analysis.py", line 167, in main
    (event_set, event_activity, source_model) = create_event_set(eqrm_flags, parallel)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 1531, in create_event_set
    generate_event_set(parallel, eqrm_flags)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 1390, in generate_event_set
    eqrm_flags.prob_number_of_events_in_zones)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 630, in generate_synthetic_events
    width=data.width)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 389, in create
    rupture_centroid_lon)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 132, in __init__
    self.length = length
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/event_set.py", line 181, in <lambda>
    lambda self, value: self._set_file_array('length', value))
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/file_store.py", line 136, in _set_file_array
    self._set_numpy_binary_array(name, array)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/file_store.py", line 126, in _set_numpy_binary_array
    save(filename, array)
  File "/usr/local/lib/python2.5/site-packages/numpy/lib/npyio.py", line 408, in save
    format.write_array(fid, arr)
  File "/usr/local/lib/python2.5/site-packages/numpy/lib/format.py", line 409, in write_array
    array.tofile(fp)
ValueError: 6761464 requested and 5376502 written

Original comment by b...@girorosso.com on 24 Feb 2012 at 5:46

GoogleCodeExporter commented 9 years ago
Revision 964 removes the temporary data store used during event set generation. 
This is no longer required to reduce memory usage as node 0 only ever generates 
the event set now.

Ran the same setdata file in save mode to generate the event set. This ran 
without issue.

Two tests in load mode:

1. 48 nodes (original number as used by David B):

Traceback (most recent call last):
  File "runhaz.py", line 40, in <module>
    run_model()
  File "runhaz.py", line 36, in run_model
    analysis.main(run,True,compress_output)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/analysis.py", line 358, in main
    sites = all_sites[i:i+1] # take site i
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/sites.py", line 118, in __getitem__
    return Sites(self.latitude[key], self.longitude[key], **attributes)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/sites.py", line 51, in __init__
    self.latitude = asarray(latitude)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/sites.py", line 66, in <lambda>
    lambda self, value: self._set_file_array('latitude', value))
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/file_store.py", line 145, in _set_file_array
    self._set_numpy_binary_array(name, array)
  File "/nas/gemd/georisk_models/earthquake/sandpits/ben/eqrm/trunk/eqrm_code/file_store.py", line 129, in _set_numpy_binary_array
    save(filename, array)
  File "/usr/local/lib/python2.5/site-packages/numpy/lib/npyio.py", line 411, in save
    fid.close()
IOError: [Errno 28] No space left on device

Observed behaviour:
- Slice assignment of the memmap objects still seem to be put in /tmp (the 
originally loaded arrays are in data_dir==output_dir for this setdata file). 
For 48 nodes this appears to completely fill up /tmp on rhe-compute1, so 
attempted a second load test with 24 nodes.
- The simulation is IO bound until the first site SA calculations are completed.

2. 24 nodes:
Completed successfully in 3 hours 41 mins. See log snippet below (node 0 always 
finishes last).

On node 0, rhe-compute1.ga.gov.au clock (processor) time taken overall 
2:54:12.750000 hr:min:sec.
On node 0, rhe-compute1.ga.gov.au wall time taken overall 3:41:06.173783 
hr:min:sec.
wall_time_taken_overall_seconds = 13266.1737831

Original comment by b...@girorosso.com on 27 Feb 2012 at 10:39

GoogleCodeExporter commented 9 years ago
Tested with 'generate' mode, i.e. generate the event set and continue to run 
the simulation, with 24 nodes on rhe-compute1 successfully"

On node 0, rhe-compute1.ga.gov.au clock (processor) time taken overall 
3:24:54.600000 hr:min:sec.
On node 0, rhe-compute1.ga.gov.au wall time taken overall 4:16:48.288327 
hr:min:sec.
wall_time_taken_overall_seconds = 15408.28832

The extra time taken is explained by the generation of the event set.

Original comment by b...@girorosso.com on 28 Feb 2012 at 3:28

GoogleCodeExporter commented 9 years ago
Attempted with 40 nodes on rhe-compute1 and got the fwrite error when /tmp got 
to about 86%

ValueError: 245811 requested and 151542 written

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-lvtmp
                      2.0G  1.6G  282M  86% /tmp

Original comment by b...@girorosso.com on 28 Feb 2012 at 4:12

GoogleCodeExporter commented 9 years ago
As an update to comment 4 - the observed behaviour: slices of event set and 
sites result in the slice .npy file being put in /tmp.

This was due to the attribute _dir being lost in the slice. Both Event_Set and 
Sites implement __getitem__ which return a new object with the sliced vector 
attributes. Upon instantiating the new object the attributes are assumed to 
have the same name as the arguments to the classmethod. In this case _dir is 
not being passed as dir so it defaults to None:

        args = {}
        for att in self.introspect_attributes():
            if getattr(self, att) is None:
                args[att] = None
            else:
                args[att] = getattr(self, att)[key]
        return Event_Set(**args) # FIXME relies on arg/attr name correspondence

Note the FIXME.

Resolution:
- Sites is no longer a File_Store object. The implementation did not cover all 
vectors so for now the file arrays are no longer used
- Event_Set __getitem__ now passes the _dir as dir into the classmethod

Original comment by b...@girorosso.com on 29 Feb 2012 at 2:42

GoogleCodeExporter commented 9 years ago
The resolution in comment 7 is implemented in revision 972.

Original comment by b...@girorosso.com on 29 Feb 2012 at 2:56