gdtiti / alembic

Automatically exported from code.google.com/p/alembic
0 stars 0 forks source link

SimpleAbcViewer aborts with SEGV, hdf5 missing attribute exception, exception in free() #203

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
This problem is hard to reproduce, but we want to capture the details so I'm 
creating this ticket.

We have one test asset, where the alembic file is good, that will crash the 
SimpleAbcViewer after hitting play.

I think this bug is in the SimpleAbcViewer code, or at least it is specific to 
how it's caling the Alembic api. Other applications read this abc file without 
problem.

I get different bug behavior depending on how I invoke SimpleAbcViewer (full 
path to abc file, local path, full path to SimpleAbcViewer vs basename). Most 
of the time the asset doesn't crash. When it does crash though I can enter the 
identical command and it will reliably crash at the exact same point.

This feels like a stack related memory bug.

Running from within gdb doesn't crash. Running valgrind doesn't crash and no 
errors are reported.

I've seen 3 styles of aborts with this one asset
- HDF5 exception raised that attribute can't be found
- HDF5 exception raised that internal link was invalid
- SEGV, 
core dump repotsCore was generated by 
`/var/tmp/doNotRemove/alembic_build/examples/bin/SimpleAbcViewer/SimpleAbcViewer
'.
Program terminated with signal 11, Segmentation fault.
[New process 12817]
#0  0x0000003cade6f5c5 in malloc_usable_size () from /lib64/libc.so.6
(gdb) where
#0  0x0000003cade6f5c5 in malloc_usable_size () from /lib64/libc.so.6
#1  0x0000003cb8ca15af in ?? () from /usr/lib64/libGL.so.1
#2  0x0000003cb4e0089b in ?? () from /usr/lib64/tls/libnvidia-tls.so.256.53
#3  0x0000000000749ef2 in H5MM_xfree (mem=0x400000004) at H5MM.c:247
#4  0x000000000064bb71 in H5B2_leaf_free (leaf=0x9fb2cc0) at H5B2int.c:3056
#5  0x000000000063bd16 in H5B2_cache_leaf_load (f=0xf8dc90, dxpl_id=167772168,
    addr=42685253, _udata=0x7fff0d7ec8f0) at H5B2cache.c:922
#6  0x000000000065e87d in H5C_load_entry (f=0xf8dc90, dxpl_id=167772168,
    type=0xe03340, addr=42685253, udata=0x7fff0d7ec8f0) at H5C.c:7955
#7  0x0000000000653762 in H5C_protect (f=0xf8dc90, primary_dxpl_id=167772168,
    secondary_dxpl_id=167772168, type=0xe03340, addr=42685253,
    udata=0x7fff0d7ec8f0, flags=512) at H5C.c:3563
#8  0x0000000000630f34 in H5AC_protect (f=0xf8dc90, dxpl_id=167772168,
    type=0xe03340, addr=42685253, udata=0x7fff0d7ec8f0, rw=H5AC_READ)
    at H5AC.c:1312
#9  0x0000000000647284 in H5B2_protect_leaf (hdr=0x50c16e0, dxpl_id=167772168,
    addr=42685253, nrec=22, rw=H5AC_READ) at H5B2int.c:1820
#10 0x0000000000635536 in H5B2_find (bt2=0xb893640, dxpl_id=167772168,
    udata=0x7fff0d7eca30, op=0, op_data=0x0) at H5B2.c:503
#11 0x000000000095e888 in H5A_dense_exists (f=0xf8dc90, dxpl_id=167772168,
    ainfo=0x7fff0d7ecb10, name=0xb957928 "P.smp0.dims") at H5Adense.c:1754
#12 0x000000000076241f in H5O_attr_exists (loc=0x11aea28,
    name=0xb957928 "P.smp0.dims", dxpl_id=167772168) at H5Oattribute.c:1857
#13 0x000000000062a9d1 in H5Aexists (obj_id=33554574,
    attr_name=0xb957928 "P.smp0.dims") at H5A.c:2553
#14 0x00000000005b2383 in Alembic::AbcCoreHDF5::v1::ReadArray (iCache=
        {px = 0x7fff0d7ed160, pn = {pi_ = 0x7fff0d7ecc90}}, iParent=33554574,
    iName=@0x7fff0d7ed390, iDataType=@0x11b5f80, iFileType=50331702,
    iNativeType=50331690)
    at /home/bleair/src/bleair-helper/lib/Alembic/AbcCoreHDF5/ReadUtil.cpp:587
#15 0x00000000005ebe23 in Alembic::AbcCoreHDF5::v1::AprImpl::readSample (
    this=0x11b0380, iGroup=33554574, iSampleName=@0x7fff0d7ed390,
    iSampleIndex=0, oSamplePtr=@0x7fff0d7ed4d0)
    at /home/bleair/src/bleair-helper/lib/Alembic/AbcCoreHDF5/AprImpl.cpp:137
#16 0x00000000005edf65 in 
Alembic::AbcCoreHDF5::v1::SimplePrImpl<Alembic::AbcCoreAbstract::v1::ArrayProper
tyReader, Alembic::AbcCoreHDF5::v1::AprImpl, 
boost::shared_ptr<Alembic::AbcCoreAbstract::v1::ArraySample>&>::getSample (
    this=0x11b0380, iSampleIndex=0, oSample=@0x7fff0d7ed4d0)
    at /home/bleair/src/bleair-helper/lib/Alembic/AbcCoreHDF5/SimplePrImpl.h:338
#17 0x000000000059d232 in Alembic::Abc::IArrayProperty::get (this=0x11b4560,
    oSamp=@0x7fff0d7ed4d0, iSS=@0x7fff0d7ed7d0)
    at /home/bleair/src/bleair-helper/lib/Alembic/Abc/IArrayProperty.cpp:110
#18 0x000000000054dc8f in 
Alembic::Abc::ITypedArrayProperty<Alembic::Abc::V3fTPTraits>::get 
(this=0x11b4560, iVal=@0x7fff0d7ed660, iSS=@0x7fff0d7ed7d0)
    at /home/bleair/src/bleair-helper/lib/Alembic/Abc/ITypedArrayProperty.h:136
#19 0x000000000058669e in Alembic::AbcGeom::ISubDSchema::get (this=0x11b4540,
    oSample=@0x7fff0d7ed660, iSS=@0x7fff0d7ed7d0)
    at /home/bleair/src/bleair-helper/lib/Alembic/AbcGeom/ISubD.cpp:102
#20 0x0000000000552033 in SimpleAbcViewer::ISubDDrw::setTime (this=0x11b44a0,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/ISubDDrw.cpp:109
---Type <return> to continue, or q <return> to quit---
#21 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (
    this=0x11a77d0, iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#22 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0x11a77d0,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#23 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (
    this=0x10614d0, iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#24 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0x10614d0,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#25 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (
    this=0x1023c60, iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#26 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0x1023c60,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#27 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (this=0xfd69f0,
    iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#28 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0xfd69f0,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#29 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (this=0xfcd140,
    iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#30 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0xfcd140,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#31 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (this=0xfb97f0,
    iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#32 0x0000000000553c93 in SimpleAbcViewer::IXformDrw::setTime (this=0xfb97f0,
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IXformDrw.cpp:94
#33 0x000000000053210f in SimpleAbcViewer::IObjectDrw::setTime (this=0xfaf220,
    iTime=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/IObjectDrw.cpp:176
#34 0x0000000000559d81 in SimpleAbcViewer::Scene::setTime (this=0xf75c60,
---Type <return> to continue, or q <return> to quit---
    iSeconds=0.041666666666666664)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/Scene.cpp:185
#35 0x0000000000567995 in SimpleAbcViewer::Transport::tickForward (
    this=0xf75c60)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/Transport.h:73
#36 0x000000000055d1da in TickForward ()
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/Viewer.cpp:136
#37 0x000000000055d32e in SimpleAbcViewer::playFwdIdle ()
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/Viewer.cpp:173
#38 0x00007f9c6b202007 in glutMainLoop () from /usr/lib64/libglut.so.3
#39 0x000000000055eaff in SimpleAbcViewer::SimpleViewScene (argc=2,
    argv=0x7fff0d7ef238)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/Viewer.cpp:573
#40 0x000000000056a3c7 in main (argc=2, argv=0x7fff0d7ef238)
    at /home/bleair/src/bleair-helper/examples/bin/SimpleAbcViewer/main.cpp:43

Original issue reported on code.google.com by ble...@gmail.com on 14 Jul 2011 at 4:44

GoogleCodeExporter commented 8 years ago
Folks, my fear is that this is actually an extremely HUGE issue; this simply 
should not be happening, and the fact that it's a Heisenbug is even more 
disturbing.

We may want to dogpile this really aggressively.

Original comment by ard...@gmail.com on 15 Jul 2011 at 10:20

GoogleCodeExporter commented 8 years ago
I can HAZ asset?
Or at least some statistics about the asset?

Original comment by miller.lucas on 15 Jul 2011 at 10:52

GoogleCodeExporter commented 8 years ago
I created a programmatic test asset:

http://code.google.com/r/ardent-embic/source/detail?r=4ea1c139a89fb1cd4200de5f7a
c6f3d61b7c03fb&name=default

Here's how I can get the crash:

ferment bin/SimpleAbcViewer> pwd
/home/jardent/alembic_build/examples/bin/SimpleAbcViewer

ferment bin/SimpleAbcViewer> ./SimpleAbcViewer 
~/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc
viewerpath: ./SimpleAbcViewer
renderscript: ./SimpleAbcViewerRenderit
Beginning to open archive: 
/home/jardent/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc
Opened archive and top object, creating drawables.
Created drawables, getting time range.

Min Time: 0.25 seconds
Max Time: 4.375 seconds

Loading min time.
Done opening archive. Elapsed time: 5.22 seconds.
Bounds at min time: (-1.15883 -1 -1.40883) to (1.65883 1 1.40883)
HDF5-DIAG: Error detected in HDF5 (1.8.7) thread 140512367306528:
  #000: H5A.c line 550 in H5Aopen(): unable to load attribute info from object header
    major: Attribute
    minor: Unable to initialize object
  #001: H5Oattribute.c line 512 in H5O_attr_open_by_name(): can't open attribute
    major: Attribute
    minor: Can't open object
  #002: H5HF.c line 680 in H5HF_op(): can't operate on object from fractal heap
    major: Heap
    minor: Can't operate on object
  #003: H5HFman.c line 468 in H5HF_man_op(): unable to operate on heap object
    major: Heap
    minor: Can't operate on object
  #004: H5HFman.c line 296 in H5HF_man_op_real(): unable to protect fractal heap direct block
    major: Heap
    minor: Unable to protect metadata
  #005: H5HFdblock.c line 489 in H5HF_man_dblock_protect(): unable to protect fractal heap direct block
    major: Heap
    minor: Unable to protect metadata
  #006: H5AC.c line 1322 in H5AC_protect(): H5C_protect() failed.
    major: Object cache
    minor: Unable to protect metadata
  #007: H5C.c line 3567 in H5C_protect(): can't load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #008: H5C.c line 7957 in H5C_load_entry(): unable to load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #009: H5HFcache.c line 1307 in H5HF_cache_dblock_load(): can't read fractal heap direct block
    major: Heap
    minor: Read failed
  #010: H5Fio.c line 113 in H5F_block_read(): read through metadata accumulator failed
    major: Low-level I/O
    minor: Read failed
  #011: H5Faccum.c line 205 in H5F_accum_read(): driver read request failed
    major: Low-level I/O
    minor: Read failed
  #012: H5FDint.c line 142 in H5FD_read(): driver read request failed
    major: Virtual File Layer
    minor: Read failed
  #013: H5FDsec2.c line 719 in H5FD_sec2_read(): file read failed: time = Fri Jul 15 17:20:44 2011
, filename = 
'/home/jardent/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc', 
file descriptor = 10, errno = 14, error message = 'Bad address', buf = 
0x659b000, size = 200, offset = 1083561
    major: Low-level I/O
    minor: Read failed
terminate called after throwing an instance of 'Alembic::Util::v1::Exception'
  what():  IXformSchema::get()
ERROR: EXCEPTION:
IScalarProperty::get()
ERROR: EXCEPTION:
Couldn't open attribute named: .inherits.smp0
Abort (core dumped)

ferment bin/SimpleAbcViewer>

Original comment by ard...@gmail.com on 16 Jul 2011 at 12:22

GoogleCodeExporter commented 8 years ago
Ah, a couple other things to note:

1) the crash happens once you try to play the file (hitting the '.' key);

2) some unspecified level of complexity in the asset is required.  The smallest 
I found was to call recurseCreateXform() with children = 2, level = 10.

Original comment by ard...@gmail.com on 16 Jul 2011 at 12:24

GoogleCodeExporter commented 8 years ago
Well, I tried to reproduce at home, on my much-more-modern Ubuntu laptop, and 
failed to.  The Archive was created with 10 levels and 3 children, resulting in 
about half a million Objects (about 250k xforms, 250k meshes), and the 
resulting file size was about 3GB.  It took forever, but it played.

It's distinctly possible that this error is due to ILM's ancient 
compiler/toolchain.

Original comment by ard...@gmail.com on 18 Jul 2011 at 5:43

GoogleCodeExporter commented 8 years ago
I also wasn't able to reproduce this at Imageworks, so I currently suspect it 
might be some site specific difference at ILM.

Original comment by miller.lucas on 18 Jul 2011 at 5:48

GoogleCodeExporter commented 8 years ago
Speaking with the team here, this issue can be closed as invalid.

Feel free to comment if you have any questions.

Thanks, Scott

Original comment by scottmmo...@gmail.com on 20 Jul 2011 at 10:02

GoogleCodeExporter commented 8 years ago
After a bit of time with some deep debugging and malloc hooking I'm convinced 
that this bug is the result of our nvidia drivers (And _maybe_ an outside 
chance our glut/glew libs - but this seems less likely).

We have nvidia drivers 256.53:
NVRM version: NVIDIA UNIX x86_64 Kernel Module  256.53  Fri Aug 27 20:27:48 PDT 
2010
GCC version:  gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)

When this abort would occurr in gdb I can see that H5FD_sec2_read() is calling  
HDread and trying to read into memory and the system call reports bad address. 
The read buffer is valid and was correctly allocated. (H5B2_hdr_init () 
allocates a page of 512 bytes. In the case I had the SEGV is trigger in trying 
to read for an attribute (H5B2_ATTR_DENSE_NAME_ID). The system call is 
reporting bad address and it's right a memory page boundary, but after hooking 
free I know that the memory is still valid (and I can see BTLF in all the bytes 
- BtreeLeaf page).  I then looked g at proc/maps and I can see that every other 
memory page has had its permissions bits flipped to being read only. I straced 
mprotect which didn't actually show anything except for std startup stuff so 
I'm 80% convinced this is the nvidida driver doing bad stuff (via the kernel) 
to our proc's memory pages.

It's possible that SimpleAbcViewer has some calls to Gl that are leaking 
resoruces, but even if SimpleAbcViewer is doing that the nividia drivers should 
not be claiming the HDF5 memory.
If someone wants to audit the gl calls in SimpleAbcViewer, cool, but otherwise 
I call this an nvidida driver bug and we can close this. 

Here's a portion of maps
0d2dd000-0d2f2000 r--p 00000000 00:00 0 
0d2f2000-0d2f4000 rw-p 00000000 00:00 0 
0d2f4000-0d307000 r--p 00000000 00:00 0 
0d307000-0d309000 rw-p 00000000 00:00 0 
0d309000-0d314000 r--p 00000000 00:00 0 
0d314000-0d316000 rw-p 00000000 00:00 0 
0d316000-0d31d000 r--p 00000000 00:00 0 
0d31d000-0d31f000 rw-p 00000000 00:00 0 
0d31f000-0d32f000 r--p 00000000 00:00 0 
0d32f000-0d330000 rw-p 00000000 00:00 0 
0d330000-0d332000 r--p 00000000 00:00 0 
0d332000-0d333000 rw-p 00000000 00:00 0 
0d333000-0d34c000 r--p 00000000 00:00 0 
0d34c000-0d34d000 rw-p 00000000 00:00 0 
0d34d000-0d352000 r--p 00000000 00:00 0 
0d352000-0d353000 rw-p 00000000 00:00 0 
...
3cb4e00000-3cb4e01000 r-xp 00000000 fd:00 3529469                        
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb4e01000-3cb5001000 ---p 00001000 fd:00 3529469                        
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb5001000-3cb5002000 rw-p 00001000 fd:00 3529469                        
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb5600000-3cb567e000 r-xp 00000000 fd:00 3518069                        
/usr/lib64/libGLU.so.1.3.060501
3cb567e000-3cb587e000 ---p 0007e000 fd:00 3518069                        
/usr/lib64/libGLU.so.1.3.060501
3cb587e000-3cb5880000 rw-p 0007e000 fd:00 3518069                        
/usr/lib64/libGLU.so.1.3.060501
3cb5a00000-3cb5a08000 r-xp 00000000 fd:00 4916555                        
/usr/lib64/libXi.so.6.0.0
3cb5a08000-3cb5c07000 ---p 00008000 fd:00 4916555                        
/usr/lib64/libXi.so.6.0.0
3cb5c07000-3cb5c08000 rw-p 00007000 fd:00 4916555                        
/usr/lib64/libXi.so.6.0.0
3cb8c00000-3cb8cbd000 r-xp 00000000 fd:00 4718799                        
/usr/lib64/libGL.so.256.53
3cb8cbd000-3cb8ebc000 ---p 000bd000 fd:00 4718799                        
/usr/lib64/libGL.so.256.53
3cb8ebc000-3cb8ef2000 rwxp 000bc000 fd:00 4718799                        
/usr/lib64/libGL.so.256.53
3cb8ef2000-3cb8f08000 rwxp 00000000 00:00 0 
3cbf000000-3cc028e000 r-xp 00000000 fd:00 4718798                        
/usr/lib64/libnvidia-glcore.so.256.53
3cc028e000-3cc048e000 ---p 0128e000 fd:00 4718798                        
/usr/lib64/libnvidia-glcore.so.256.53
3cc048e000-3cc09d7000 rwxp 0128e000 fd:00 4718798                        
/usr/lib64/libnvidia-glcore.so.256.53
3cc09d7000-3cc09ed000 rwxp 00000000 00:00 0 
3cc1800000-3cc185b000 r-xp 00000000 fd:00 3514143                        
/usr/lib64/libXt.so.6.0.0

I am 99% sure that this isn't a compiler version problem.

Original comment by ble...@gmail.com on 21 Jul 2011 at 6:18

GoogleCodeExporter commented 8 years ago
I think you are correct, I was testing this on Fedora 13 and finally got a 
crash with the Siggraph asset in just SimpleAbcViewer.

In Maya it worked fine.

gcc version 4.4.5 20101112 (Red Hat 4.4.5-2) (GCC) 
NVIDIA UNIX x86_64 Kernel Module  275.09.07

Original comment by miller.lucas on 21 Jul 2011 at 6:39