Closed GoogleCodeExporter closed 8 years ago
Folks, my fear is that this is actually an extremely HUGE issue; this simply
should not be happening, and the fact that it's a Heisenbug is even more
disturbing.
We may want to dogpile this really aggressively.
Original comment by ard...@gmail.com
on 15 Jul 2011 at 10:20
I can HAZ asset?
Or at least some statistics about the asset?
Original comment by miller.lucas
on 15 Jul 2011 at 10:52
I created a programmatic test asset:
http://code.google.com/r/ardent-embic/source/detail?r=4ea1c139a89fb1cd4200de5f7a
c6f3d61b7c03fb&name=default
Here's how I can get the crash:
ferment bin/SimpleAbcViewer> pwd
/home/jardent/alembic_build/examples/bin/SimpleAbcViewer
ferment bin/SimpleAbcViewer> ./SimpleAbcViewer
~/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc
viewerpath: ./SimpleAbcViewer
renderscript: ./SimpleAbcViewerRenderit
Beginning to open archive:
/home/jardent/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc
Opened archive and top object, creating drawables.
Created drawables, getting time range.
Min Time: 0.25 seconds
Max Time: 4.375 seconds
Loading min time.
Done opening archive. Elapsed time: 5.22 seconds.
Bounds at min time: (-1.15883 -1 -1.40883) to (1.65883 1 1.40883)
HDF5-DIAG: Error detected in HDF5 (1.8.7) thread 140512367306528:
#000: H5A.c line 550 in H5Aopen(): unable to load attribute info from object header
major: Attribute
minor: Unable to initialize object
#001: H5Oattribute.c line 512 in H5O_attr_open_by_name(): can't open attribute
major: Attribute
minor: Can't open object
#002: H5HF.c line 680 in H5HF_op(): can't operate on object from fractal heap
major: Heap
minor: Can't operate on object
#003: H5HFman.c line 468 in H5HF_man_op(): unable to operate on heap object
major: Heap
minor: Can't operate on object
#004: H5HFman.c line 296 in H5HF_man_op_real(): unable to protect fractal heap direct block
major: Heap
minor: Unable to protect metadata
#005: H5HFdblock.c line 489 in H5HF_man_dblock_protect(): unable to protect fractal heap direct block
major: Heap
minor: Unable to protect metadata
#006: H5AC.c line 1322 in H5AC_protect(): H5C_protect() failed.
major: Object cache
minor: Unable to protect metadata
#007: H5C.c line 3567 in H5C_protect(): can't load entry
major: Object cache
minor: Unable to load metadata into cache
#008: H5C.c line 7957 in H5C_load_entry(): unable to load entry
major: Object cache
minor: Unable to load metadata into cache
#009: H5HFcache.c line 1307 in H5HF_cache_dblock_load(): can't read fractal heap direct block
major: Heap
minor: Read failed
#010: H5Fio.c line 113 in H5F_block_read(): read through metadata accumulator failed
major: Low-level I/O
minor: Read failed
#011: H5Faccum.c line 205 in H5F_accum_read(): driver read request failed
major: Low-level I/O
minor: Read failed
#012: H5FDint.c line 142 in H5FD_read(): driver read request failed
major: Virtual File Layer
minor: Read failed
#013: H5FDsec2.c line 719 in H5FD_sec2_read(): file read failed: time = Fri Jul 15 17:20:44 2011
, filename =
'/home/jardent/alembic_build/lib/Alembic/AbcGeom/Tests/transformingMesh1.abc',
file descriptor = 10, errno = 14, error message = 'Bad address', buf =
0x659b000, size = 200, offset = 1083561
major: Low-level I/O
minor: Read failed
terminate called after throwing an instance of 'Alembic::Util::v1::Exception'
what(): IXformSchema::get()
ERROR: EXCEPTION:
IScalarProperty::get()
ERROR: EXCEPTION:
Couldn't open attribute named: .inherits.smp0
Abort (core dumped)
ferment bin/SimpleAbcViewer>
Original comment by ard...@gmail.com
on 16 Jul 2011 at 12:22
Ah, a couple other things to note:
1) the crash happens once you try to play the file (hitting the '.' key);
2) some unspecified level of complexity in the asset is required. The smallest
I found was to call recurseCreateXform() with children = 2, level = 10.
Original comment by ard...@gmail.com
on 16 Jul 2011 at 12:24
Well, I tried to reproduce at home, on my much-more-modern Ubuntu laptop, and
failed to. The Archive was created with 10 levels and 3 children, resulting in
about half a million Objects (about 250k xforms, 250k meshes), and the
resulting file size was about 3GB. It took forever, but it played.
It's distinctly possible that this error is due to ILM's ancient
compiler/toolchain.
Original comment by ard...@gmail.com
on 18 Jul 2011 at 5:43
I also wasn't able to reproduce this at Imageworks, so I currently suspect it
might be some site specific difference at ILM.
Original comment by miller.lucas
on 18 Jul 2011 at 5:48
Speaking with the team here, this issue can be closed as invalid.
Feel free to comment if you have any questions.
Thanks, Scott
Original comment by scottmmo...@gmail.com
on 20 Jul 2011 at 10:02
After a bit of time with some deep debugging and malloc hooking I'm convinced
that this bug is the result of our nvidia drivers (And _maybe_ an outside
chance our glut/glew libs - but this seems less likely).
We have nvidia drivers 256.53:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.53 Fri Aug 27 20:27:48 PDT
2010
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
When this abort would occurr in gdb I can see that H5FD_sec2_read() is calling
HDread and trying to read into memory and the system call reports bad address.
The read buffer is valid and was correctly allocated. (H5B2_hdr_init ()
allocates a page of 512 bytes. In the case I had the SEGV is trigger in trying
to read for an attribute (H5B2_ATTR_DENSE_NAME_ID). The system call is
reporting bad address and it's right a memory page boundary, but after hooking
free I know that the memory is still valid (and I can see BTLF in all the bytes
- BtreeLeaf page). I then looked g at proc/maps and I can see that every other
memory page has had its permissions bits flipped to being read only. I straced
mprotect which didn't actually show anything except for std startup stuff so
I'm 80% convinced this is the nvidida driver doing bad stuff (via the kernel)
to our proc's memory pages.
It's possible that SimpleAbcViewer has some calls to Gl that are leaking
resoruces, but even if SimpleAbcViewer is doing that the nividia drivers should
not be claiming the HDF5 memory.
If someone wants to audit the gl calls in SimpleAbcViewer, cool, but otherwise
I call this an nvidida driver bug and we can close this.
Here's a portion of maps
0d2dd000-0d2f2000 r--p 00000000 00:00 0
0d2f2000-0d2f4000 rw-p 00000000 00:00 0
0d2f4000-0d307000 r--p 00000000 00:00 0
0d307000-0d309000 rw-p 00000000 00:00 0
0d309000-0d314000 r--p 00000000 00:00 0
0d314000-0d316000 rw-p 00000000 00:00 0
0d316000-0d31d000 r--p 00000000 00:00 0
0d31d000-0d31f000 rw-p 00000000 00:00 0
0d31f000-0d32f000 r--p 00000000 00:00 0
0d32f000-0d330000 rw-p 00000000 00:00 0
0d330000-0d332000 r--p 00000000 00:00 0
0d332000-0d333000 rw-p 00000000 00:00 0
0d333000-0d34c000 r--p 00000000 00:00 0
0d34c000-0d34d000 rw-p 00000000 00:00 0
0d34d000-0d352000 r--p 00000000 00:00 0
0d352000-0d353000 rw-p 00000000 00:00 0
...
3cb4e00000-3cb4e01000 r-xp 00000000 fd:00 3529469
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb4e01000-3cb5001000 ---p 00001000 fd:00 3529469
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb5001000-3cb5002000 rw-p 00001000 fd:00 3529469
/usr/lib64/tls/libnvidia-tls.so.256.53
3cb5600000-3cb567e000 r-xp 00000000 fd:00 3518069
/usr/lib64/libGLU.so.1.3.060501
3cb567e000-3cb587e000 ---p 0007e000 fd:00 3518069
/usr/lib64/libGLU.so.1.3.060501
3cb587e000-3cb5880000 rw-p 0007e000 fd:00 3518069
/usr/lib64/libGLU.so.1.3.060501
3cb5a00000-3cb5a08000 r-xp 00000000 fd:00 4916555
/usr/lib64/libXi.so.6.0.0
3cb5a08000-3cb5c07000 ---p 00008000 fd:00 4916555
/usr/lib64/libXi.so.6.0.0
3cb5c07000-3cb5c08000 rw-p 00007000 fd:00 4916555
/usr/lib64/libXi.so.6.0.0
3cb8c00000-3cb8cbd000 r-xp 00000000 fd:00 4718799
/usr/lib64/libGL.so.256.53
3cb8cbd000-3cb8ebc000 ---p 000bd000 fd:00 4718799
/usr/lib64/libGL.so.256.53
3cb8ebc000-3cb8ef2000 rwxp 000bc000 fd:00 4718799
/usr/lib64/libGL.so.256.53
3cb8ef2000-3cb8f08000 rwxp 00000000 00:00 0
3cbf000000-3cc028e000 r-xp 00000000 fd:00 4718798
/usr/lib64/libnvidia-glcore.so.256.53
3cc028e000-3cc048e000 ---p 0128e000 fd:00 4718798
/usr/lib64/libnvidia-glcore.so.256.53
3cc048e000-3cc09d7000 rwxp 0128e000 fd:00 4718798
/usr/lib64/libnvidia-glcore.so.256.53
3cc09d7000-3cc09ed000 rwxp 00000000 00:00 0
3cc1800000-3cc185b000 r-xp 00000000 fd:00 3514143
/usr/lib64/libXt.so.6.0.0
I am 99% sure that this isn't a compiler version problem.
Original comment by ble...@gmail.com
on 21 Jul 2011 at 6:18
I think you are correct, I was testing this on Fedora 13 and finally got a
crash with the Siggraph asset in just SimpleAbcViewer.
In Maya it worked fine.
gcc version 4.4.5 20101112 (Red Hat 4.4.5-2) (GCC)
NVIDIA UNIX x86_64 Kernel Module 275.09.07
Original comment by miller.lucas
on 21 Jul 2011 at 6:39
Original issue reported on code.google.com by
ble...@gmail.com
on 14 Jul 2011 at 4:44