NCAR / ucomp-pipeline

Data processing pipeline for UCoMP
Other
5 stars 3 forks source link

Fix launch process hang #176

Closed mgalloy closed 3 months ago

mgalloy commented 12 months ago

The Python process hangs at the end of a UCoMP reprocessing run.

The last successful nightly regression test that 20220325 ran for was 2022-08-31. The nightly test runs two days — the 20210810 date continues to run successfully, but the 2nd date and final report have not run since then.

Tasks

Summary

The main-level examples at the end of the production code files that seem to be causing this issue have been moved to separate files in the examples/ directory.

[!NOTE] Time estimate: I am stuck on how to fix this since it doesn't seem to be an issue with the code, but the build environment. I cannot specify even an estimate of how long to fix this issue, but it must be done before the next reprocessing.

mgalloy commented 12 months ago

The commits for 2022-09-01 are:

I don't see why any of these commits would be problematic.

mgalloy commented 12 months ago

The last change to ucomp.in was 8b6caefed50727fd5fa4aefd0e8ed3d8a2d1fe94 ("Printing bar graph of number of FITS files in directories for ucomp ls") and bd5f9ff1b797cd7562da1fe0cb35aad8172bb0cd ("Adding try/catch for keyboard interrupts for anything") on 2022-08-30.

mgalloy commented 12 months ago

The launch process still hangs at 3ed23ab59e10d929d673062286c314b71a4587d6.

mgalloy commented 12 months ago

The launch process still hangs at f725c864c485b17afe421b8fa56ca3182cd3b2ad — before all the recent changes to ucomp.in.

mgalloy commented 12 months ago

The released production version 0.5.0 does not hang.

mgalloy commented 12 months ago

I think this is something about the environment. When I checkout the actual release version for 0.5.0 (either commit hash or the production branch) and configure/build/install it again in my software directory, it hangs.

What are differences it could find?

mgalloy commented 12 months ago

I realized that mahi had a bunch of hung processes because of the nightly regression tests. When I started killing them, they moved on to the next day and send report/error emails. An error I got was:

Intel MKL FATAL ERROR: Cannot load /home/mgalloy/anaconda3/lib/python3.10/site-packages/mkl/../../../libmkl_rt.so.1.
mgalloy commented 12 months ago

Building on other machines does not help. The other machine didn't have the mysql-dev package, so that is not the problem either.

mgalloy commented 11 months ago

It seems to be IDL hanging. When a recent process hung, this was the output from `ps:

mgalloy  10260  0.0  0.0 400432 59004 pts/2    S+   12:23   0:00 /home/mgalloy/anaconda3/bin/python3.10 ./ucomp reprocess -f latest 20220729
mgalloy  10262  0.0  0.0 113292  1484 pts/2    S+   12:23   0:00 /bin/sh /home/mgalloy/software/ucomp-pipeline/bin/ucomp_script.sh ucomp_reprocess_wrapper /home/mgalloy/projects/ucomp-config/ucomp.latest.cfg 20220729
mgalloy  10272 98.8  0.0 358304 19876 pts/2    Rl+  12:23  22:34 /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/idl -quiet -IDL_QUIET 1 -IDL_STARTUP  -IDL_PATH +/home/mgalloy/software/ucomp-pipeline/src:/home/mgalloy/software/ucomp-pipeline/regression:/home/mgalloy/software/ucomp-pipeline/ssw:/home/mgalloy/software/ucomp-pipeline/gen:+/home/mgalloy/software/ucomp-pipeline/lib:<IDL_DEFAULT> -IDL_DLM_PATH +/home/mgalloy/software/ucomp-pipeline/lib:+/home/mgalloy/software/ucomp-pipeline/src:<IDL_DEFAULT> -e ucomp_reprocess_wrapper, '20220729', '/home/mgalloy/projects/ucomp-config/ucomp.latest.cfg'

It's using a lot of CPU as well. Defining the process states:

S    interruptible sleep (waiting for an event to complete)
R    running or runnable (on run queue)

l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
+    is in the foreground process group

The Python and bash scripts are suspended, waiting for the IDL script to come back. The IDL script is running (with a lot of CPU), but doing nothing.

mgalloy commented 11 months ago

Launching the process directly from the command line instead of through the Python and bash scripts, still hangs:

$ ucomp-pipeline$ /opt/share/idl8.7.3/idl87/bin/idl -quiet -IDL_QUIET 1 -IDL_STARTUP "" -IDL_PATH "+/home/mgalloy/software/ucomp-pipeline/src:/home/mgalloy/software/ucomp-pipeline/regression:/home/mgalloy/software/ucomp-pipeline/ssw:/home/mgalloy/software/ucomp-pipeline/gen:+/home/mgalloy/software/ucomp-pipeline/lib:<IDL_DEFAULT>" -IDL_DLM_PATH "+/home/mgalloy/software/ucomp-pipeline/lib:+/home/mgalloy/software/ucomp-pipeline/src:<IDL_DEFAULT>" -e "ucomp_reprocess_wrapper, '20220729', '/home/mgalloy/projects/ucomp-config/ucomp.latest.cfg'"
Licensed for use by: NCAR UCAR - 95183
License: 770656

So this is something to do with IDL hanging. Launching a simple command with the same command line arguments does not hang:

$ ucomp-pipeline$ /opt/share/idl8.7.3/idl87/bin/idl -quiet -IDL_QUIET 1 -IDL_STARTUP "" -IDL_PATH "+/home/mgalloy/software/ucomp-pipeline/src:/home/mgalloy/software/ucomp-pipeline/regression:/home/mgalloy/software/ucomp-pipeline/ssw:/home/mgalloy/software/ucomp-pipeline/gen:+/home/mgalloy/software/ucomp-pipeline/lib:<IDL_DEFAULT>" -IDL_DLM_PATH "+/home/mgalloy/software/ucomp-pipeline/lib:+/home/mgalloy/software/ucomp-pipeline/src:<IDL_DEFAULT>" -e "print, 'hello'"
Licensed for use by: NCAR UCAR - 95183
License: 770656
hello
mgalloy commented 11 months ago

Upgrading to IDL 8.9 did not help:

/opt/share/idl8.9/idl89/bin/idl -quiet -IDL_QUIET 1 -IDL_STARTUP "" -IDL_PATH +/home/mgalloy/software/ucomp-pipeline/src:/home/mgalloy/software/ucomp-pipeline/regression:/home/mgalloy/software/ucomp-pipeline/ssw:/home/mgalloy/software/ucomp-pipeline/gen:+/home/mgalloy/software/ucomp-pipeline/lib:<IDL_DEFAULT> -IDL_DLM_PATH +/home/mgalloy/software/ucomp-pipeline/lib:+/home/mgalloy/software/ucomp-pipeline/src:<IDL_DEFAULT> -e "ucomp_reprocess_wrapper, '20220729', '/home/mgalloy/projects/ucomp-config/ucomp.latest.cfg'"
mgalloy commented 11 months ago

Maybe this is some objects that haven't been freed? Causing an issue when the IDL process finishes?

mgalloy commented 11 months ago

I tried a help, /heap_variables right before the final end to the reprocessing command in IDL, but get:

Heap Variables:
    # Pointer: 0
    # Object : 0
mgalloy commented 11 months ago

Using a node-locked license hangs as well.

mgalloy commented 11 months ago

The KCor pipeline does not hang, even on new builds of the pipeline.

mgalloy commented 11 months ago

Others things to check:

mgalloy commented 11 months ago

Launching IDL with an empty DLM path still causes a hang:

$ /opt/share/idl8.7.3/idl87/bin/idl -quiet -IDL_QUIET 1 -IDL_STARTUP "" -IDL_PATH "+/home/mgalloy/software/ucomp-pipeline/src:/home/mgalloy/software/ucomp-pipeline/regression:/home/mgalloy/software/ucomp-pipeline/ssw:/home/mgalloy/software/ucomp-pipeline/gen:+/home/mgalloy/software/ucomp-pipeline/lib:<IDL_DEFAULT>" -IDL_DLM_PATH "" -e "ucomp_reprocess_wrapper, '20220729', '/home/mgalloy/projects/ucomp-config/ucomp.latest.cfg'"

This crashes because there are routines in DLMs that it needs, but it also hangs after the crash.

mgalloy commented 11 months ago

It looks the libraries are indeed the same versions:

ucomp-pipeline$ ldd ./lib/mysql/mg_mysql.linux.x86_64.so
    linux-vdso.so.1 =>  (0x00007ffdf75c3000)
    libidl.so.8.7 => not found
    libmysqlclient.so.18 => /usr/lib64/mysql/libmysqlclient.so.18 (0x00007f2f5eac3000)
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f2f5e6f5000)
    libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f2f5e4d9000)
    libz.so.1 => /usr/lib64/libz.so.1 (0x00007f2f5e2c3000)
    libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f2f5e051000)
    libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f2f5dbee000)
    libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f2f5d9ea000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f2f5d6e2000)
    libm.so.6 => /usr/lib64/libm.so.6 (0x00007f2f5d3e0000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f2f5f1c8000)
    libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x00007f2f5d193000)
    libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f2f5ceaa000)
    libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f2f5cca6000)
    libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00007f2f5ca73000)
    libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f2f5c85d000)
    libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00007f2f5c64d000)
    libkeyutils.so.1 => /usr/lib64/libkeyutils.so.1 (0x00007f2f5c449000)
    libresolv.so.2 => /usr/lib64/libresolv.so.2 (0x00007f2f5c22f000)
    libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f2f5c008000)
    libpcre.so.1 => /usr/lib64/libpcre.so.1 (0x00007f2f5bda6000)
ucomp-pipeline$ ldd ./lib/mg_dist_tools.linux.x86_64.so
    linux-vdso.so.1 =>  (0x00007ffd431a6000)
    libidl.so.8.7 => not found
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f8f85c7f000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f8f8624f000)
ucomp-pipeline$ ldd ./src/level1/ucomp_level1.linux.x86_64.so
    linux-vdso.so.1 =>  (0x00007ffcfddef000)
    libidl.so.8.7 => not found
    libc.so.6 => /usr/lib64/libc.so.6 (0x00007f81a4c15000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f81a51e5000)
mgalloy commented 11 months ago

I think the problem is mg_loadct.pro in d1a080a635267ad5fc907d2523b17b0b4ec39179. I can run the current code, but just deleting mg_loadct.pro (make sure to delete it in the installation, not deleting it in the installation caused me to think that old code was not working) and it works.

mgalloy commented 11 months ago

I think this solves it. It works for quick days called from the command line now. I am still not sure why just removing mg_loadct.pro solved the problem. It was the presence of mg_loadct.pro that caused the problem — even if nothing called it, the pipeline would hang.

mgalloy commented 11 months ago

I ran pstack on a process after it finished, while hung:

~$ /usr/bin/pstack 9027
Thread 18 (Thread 0x7f61bee6a700 (LWP 9032)):
#0  0x00007f61c1dd6de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61bf9145be in Poco::EventImpl::waitImpl(long) () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#2  0x00007f61bf8d29e0 in Poco::Timer::run() () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#3  0x00007f61bf8d0a9f in Poco::PooledThread::run() () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#4  0x00007f61bf8cec0b in Poco::ThreadImpl::runnableEntry(void*) () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#5  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7f61be669700 (LWP 9033)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61bf914443 in Poco::EventImpl::waitImpl() () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#2  0x00007f61bf8d0b33 in Poco::PooledThread::run() () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#3  0x00007f61bf8cec0b in Poco::ThreadImpl::runnableEntry(void*) () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libalan.so
#4  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7f61bd82f700 (LWP 9040)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7f61bd02e700 (LWP 9041)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7f61bc82d700 (LWP 9042)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f61bc02c700 (LWP 9043)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f61bb82b700 (LWP 9044)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f61bb02a700 (LWP 9045)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f61ba829700 (LWP 9046)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f61ba028700 (LWP 9047)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f61b9827700 (LWP 9048)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f61b9026700 (LWP 9049)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f61b8825700 (LWP 9050)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f61b8024700 (LWP 9051)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f61b7823700 (LWP 9052)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f61b7022700 (LWP 9053)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f61b6821700 (LWP 9054)):
#0  0x00007f61c1dd6a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f61c29062ee in IDL_ThreadBEventWait () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2906d31 in thread_pool_func () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00007f61c1dd2ea5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f61c18e5b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f61c39297c0 (LWP 9027)):
#0  0x00007f61c2a7adb9 in IDL_UProMoveMainOverflowToFrame () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#1  0x00007f61c2876106 in IDL_Executive () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#2  0x00007f61c2c627bb in IDL_ExecuteCommandFromCmdLine () from /opt/share/idl8.7.3/idl87/bin/bin.linux.x86_64/libidl.so.8.7
#3  0x00000000004015ea in main ()

Following up with IDL people.

mgalloy commented 10 months ago

This is happening again. Maybe the issue is not mg_loadct.pro, but we are at some threshold and any new file causes the problem?

mgalloy commented 10 months ago

When I added ucomp_write_l2_images.pro and ucomp_l2_file.pro, in aade8d55da5441033b1db2f1395896e5e9e882e3 this error popped again.

When I deleted ucomp_l2_dynamics.pro and ucomp_l2_polarization.pro in 31855dfc3f34a759f25140b2b1fd6376e7d03e6e it didn't go away. But when I then deleted ucomp_l2_quick_invert.pro, ucomp_write_dynamics_image.pro ucomp_write_quick_invert_image, and ucomp_write_polarization_image.pro, it did go away. Is there some complexity/lines of code limit that we are near?

mgalloy commented 10 months ago

The repo is at 555 .pro files in the !path and about 1034 routines when running a processing and everything seems OK.

mgalloy commented 9 months ago

This has not reappeared despite adding several files in the last week.

mgalloy commented 9 months ago

Let' hope this doesn't come back because I don't really understand it. Removing .pro files (make sure to get them in the installation, not just the source) seems to fix it, but I don't see any particular limit because we now have 567 .pro files in the !path and everything seems OK.

mgalloy commented 3 months ago

We are currently at 574 .pro files in the gen, lib, src, and ssw directory hierarchies. There are 1562 .pro files in the IDL 8.7.3 library.

mgalloy commented 3 months ago

It looks like commit 69ca48ec0cfbb6741a6f6ac8eff6bb2f6a126cf2 caused the hang this time. At least that's the one that the nightly regression tests stopped completing on time for. The one odd thing about this commit is a lengthy main-level example program was added to a file.

mgalloy commented 3 months ago

It's interesting that KCor doesn't have this problem:

kcor-pipeline$ find {gen,lib,src,ssw} -name '*.pro' | wc -l
730
mgalloy commented 3 months ago

The initial culprit mg_loadct.pro had a somewhat lengthy main-level program at the end of it.

mgalloy commented 3 months ago

Profiling compiles all the .pro files in the repo, so I should try a run with profiling off.

mgalloy commented 3 months ago

Removing the main-level example programs seems to have fixed this. Should report to Harris.

jburkepile commented 3 months ago

This is great news Mike. Nice work!

mgalloy commented 3 months ago

I sent an email off to Harris about this. I will probably try to make a reproduce case that causes this error so they can track this down.

mgalloy commented 3 months ago

Just to document some other things to try if this comes back: