easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
147 stars 200 forks source link

Meaningful error output (and actual location of the log file on the line about log files) #3915

Open surak opened 2 years ago

surak commented 2 years ago

The worst part of Easybuild, when it fails, is tracking down where things went wrong.

While rebuilding a LOONG toolchain, one ends up with something like this:

== configuring...
== ... (took 5 secs)
== building...
== ... (took 31 mins 50 secs)
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/swmanage/GCCcore/11.2.0/system-system): build failed (first 300 chars): cmd " make  -j 20  bootstrap " exited with exit code 2 and output:
echo stage3 > stage_final
make[1]: Entering directory `/dev/shm/swmanage/GCCcore/11.2.0/system-system/gcc-11.2.0/stage3_obj'
make[2]: Entering directory `/dev/shm/swmanage/GCCcore/11.2.0/system-system/gcc-11.2.0/stage3_obj'
make[2]:  (took 32 mins 11 secs)
== Results of the build can be found in the log file(s) 
ERROR: Build of /p/fastdata/zam/swmanage/EasyBuild/2022/Golden_Repo/g/GCCcore/GCCcore-11.2.0.eb failed (err: 'build failed (first 300 chars): cmd " make  -j 20  bootstrap " exited with exit code 2 and output:\necho stage3 > stage_final\nmake[1]: Entering directory `/dev/shm/swmanage/GCCcore/11.2.0/system-system/gcc-11.2.0/stage3_obj\'\nmake[2]: Entering directory `/dev/shm/swmanage/GCCcore/11.2.0/system-system/gcc-11.2.0/stage3_obj\'\nmake[2]: ')

No idea. The output itself is useless.

No idea - it was written at the beginning, but it's not written here at the end. The buffer scrolled so much that there's no location of the logfile here anymore. Time to fish /tmp again... Plus, the file at the beginning gets deleted, although the directory is kept.

Suggestions:

boegel commented 2 years ago

@surak I fully agree that we should do a lot better here. We should make an honest attempt to fish out the actual error message from the log, rather than only showing the first 300 characters, since that's indeed usually useless.

I didn't actually realize that the location of the log file isn't printed at the end at all, that's definitely an easy change to make.

For me it's not a problem because I have the use of eb --last-log deeply engraved into my muscle memory (see also https://easybuilders.github.io/easybuild-tutorial/2021-isc21/troubleshooting/#last-log).

I have an ebl alias for this to save more keystrokes:

alias ebl='vim $(eb --last-log | grep -v ">>")'

The grep -v part is only there because I always have $EB_VERBOSE set.

boegel commented 2 years ago

The Results ... line should actually include the path to the application build log, see https://github.com/easybuilders/easybuild-framework/blob/0c1a642848681a71709c2dffefdc305a86cb789b/easybuild/framework/easyblock.py#L4186-L4187

For some reason it's not for you though? So the glob that defines the logs variable comes back with an empty list somehow?

Can you check the temporary directory created by EasyBuild, are the logs actually still there?

Also, can you share the output of eb --show-config?

smoors commented 2 years ago

for me the log files shown at the beginning and at the end are not the same, which is a bit confusing. at the beginning:

== Temporary log file in case of crash /dev/shm/vsc10009/eb-e_w78y/easybuild-6J_zLt.log

at the end (after a crash):

== Results of the build can be found in the log file(s) /dev/shm/vsc10009/eb-e_w78y/easybuild-GROMACS-2021.4-20211222.085224.wZvkK.log
boegel commented 2 years ago

@smoors That's because there are two types of log files: a "session" log file, and a separate log file per installation being performed (which gets copied to the installation directory on completion); see also https://docs.easybuild.io/en/latest/Logfiles.html

boegel commented 2 years ago

The Results ... line should actually include the path to the application build log, see

https://github.com/easybuilders/easybuild-framework/blob/0c1a642848681a71709c2dffefdc305a86cb789b/easybuild/framework/easyblock.py#L4186-L4187

For some reason it's not for you though? So the glob that defines the logs variable comes back with an empty list somehow?

Can you check the temporary directory created by EasyBuild, are the logs actually still there?

Also, can you share the output of eb --show-config?

@surak Can you clarify?

surak commented 2 years ago

I think that the issue description is clear enough, no? There's no pointer at the actual error, neither where to find it.

What I have been doing on the supercomputers for years is to run everything with -l - as most errors tend to show close to the end, with a bit of a scroll at least.

On other environments, -l seems to print much more than before, or so was my impression. When updating the timer, does it do some ascii commands on the timer line or does it print everything again?

In any case, if something fails, the last thing I want to see on the screen is either:

Currently, I have neither.

surak commented 2 years ago
== installing extension rgl 0.107.14 (386/1038)...
  >> defining build environment for gcccoremkl/11.2.0-2021.4.0 toolchain
  >> running command:
        [started at: 2022-01-24 22:18:12]
        [working dir: /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0/rgl]
        [output logged in /tmp/eb-uw4uv1em/easybuild-run_cmd-ytni5pln.log]
        tar xzf /p/project/ccstao/cstao05/easybuild/juwelsbooster/sources/r/R/extensions/rgl_0.107.14.tar.gz
  >> command completed: exit 0, ran in < 1s
  >> running command:
        [started at: 2022-01-24 22:18:12]
        [working dir: /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0/rgl/rgl]
        [output logged in /tmp/eb-uw4uv1em/easybuild-run_cmd-eh_rwk__.log]
        R CMD INSTALL /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0/rgl/rgl   
--library=/p/project/ccstao/cstao05/easybuild/juwelsbooster/software/R/4.1.2-gcccoremkl-11.2.0-2021.4.0/lib64/R/library --no-clean-on-error
  >> command completed: exit 1, ran in 00h01m25s
==      ... (took 1 min 26 secs)
== ... (took 3 hours 8 mins 51 secs)
== FAILED: Installation ended unsuccessfully (build directory: /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0): build failed (first 300 
chars): cmd " R CMD INSTALL /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0/rgl/rgl   
--library=/p/project/ccstao/cstao05/easybuild/juwelsbooster/software/R/4.1.2-gcccoremkl-11.2.0-2021.4.0/lib64/R/library --no-clean-on-error " exited with exit 
code 1 and output:
* installing *source* pack (took 3 hours 36 mins 9 secs)
== Results of the build can be found in the log file(s) /tmp/eb-uw4uv1em/easybuild-R-4.1.2-20220124.184330.qfPqj.log
Installing 'rgl' extension (386/1038) ━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━ 3:08:15
ERROR: Build of /p/project/cjsc/a.strube/easybuild-repository/Golden_Repo/r/R/R-4.1.2-gcccoremkl-11.2.0-2021.4.0.eb failed (err: 'build failed (first 300 chars): cmd " R CMD INSTALL /dev/shm/strube1/juwelsbooster/R/4.1.2/gcccoremkl-11.2.0-2021.4.0/rgl/rgl   --library=/p/project/ccstao/cstao05/easybuild/juwelsbooster/software/R/4.1.2-gcccoremkl-11.2.0-2021.4.0/lib64/R/library --no-clean-on-error " exited with exit code 1 and output:\n* installing *source* pack')

It's like... what happened? I don't know. And the error log is not the last thing I see: Rather, it's completely useless info. I have to fish for the log file on the letter soup here.