darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 28 forks source link

darshan-job-summary hangs with darshan_partial files. #679

Open puneet336 opened 2 years ago

puneet336 commented 2 years ago

Hi, I am using darshan v3.3.1 to analyze an MPI application.

for small codes like - http://pramodkumbhar.com/2020/03/i-o-performance-analysis-with-darshan/ i got a trace file with .darshan extension with size of ~4K, and i was able to generate reports (pdf format).

but with a larger code base, which does a lot of IO, the darshan toll was unable to write trace file in working directory so i had to set DATSHAN_LOGPATH to a different filesystem as - DARSHAN_LOGPATH=ufs:/scratch/DARSHAN_LOGS1/. Afterwards, i got .darshan_partialfile of size ~1.3MB at the end of simulation. I renamed the file extension to.darshan . I have attempted multiple runs, but i still get .darshan_partial files even though simulation terminated without error messages.

Afterwards, I tried following command - darshan-job-summary.pl file.darshan

The process has been running since last 1 hour without any output.

I noticed pdflatex process is in sleep state - 2874399 puneet 20 0 124728 43824 1804 S 0.0 0.0 0:00.27 pdflatex with following content in temp directory -

-rw-r--r-- 1 user1 internal 273933 Mar  4 04:56 file-access-read.dat
-rw-r--r-- 1 user1 internal    152 Mar  4 04:56 file-access-write-sh.dat
-rw-r--r-- 1 user1 internal   7690 Mar  4 04:56 file-access-write.dat
-rw-r--r-- 1 user1 internal    174 Mar  4 04:56 file-access-read-sh.dat
-rw-r--r-- 1 user1 internal   1370 Mar  4 04:56 file-access-eps.gplt
-rw-r--r-- 1 user1 internal    759 Mar  4 04:56 access-hist-eps.gplt
-rw-r--r-- 1 user1 internal   3057 Mar  4 04:56 variance-table.tex
-rw-r--r-- 1 user1 internal    279 Mar  4 04:56 title.tex
-rw-r--r-- 1 user1 internal   7029 Mar  4 04:56 time-summary.pdf
-rw-r--r-- 1 user1 internal    901 Mar  4 04:56 time-summary-eps.gplt
-rw-r--r-- 1 user1 internal  26723 Mar  4 04:56 time-summary.eps
-rw-r--r-- 1 user1 internal    178 Mar  4 04:56 time-summary.dat
-rw-r--r-- 1 user1 internal   2563 Mar  4 04:56 summary.tex
-rw-r--r-- 1 user1 internal     95 Mar  4 04:56 stdio-op-counts.dat
-rw-r--r-- 1 user1 internal    112 Mar  4 04:56 posix-op-counts.dat
-rw-r--r-- 1 user1 internal    223 Mar  4 04:56 posix-access-hist.dat
-rw-r--r-- 1 user1 internal    819 Mar  4 04:56 pattern-eps.gplt
-rw-r--r-- 1 user1 internal     95 Mar  4 04:56 pattern.dat
-rw-r--r-- 1 user1 internal    792 Mar  4 04:56 op-counts-eps.gplt
-rw-r--r-- 1 user1 internal  27636 Mar  4 04:56 op-counts.eps
-rw-r--r-- 1 user1 internal    212 Mar  4 04:56 job-table.tex
-rw-r--r-- 1 user1 internal    413 Mar  4 04:56 fs-data-table.tex
-rw-r--r-- 1 user1 internal    428 Mar  4 04:56 file-count-table.tex
-rw-r--r-- 1 user1 internal    697 Mar  4 04:56 file-access-table.tex
-rw-r--r-- 1 user1 internal    445 Mar  4 04:56 access-table.tex
-rw-r--r-- 1 user1 internal   7572 Mar  4 04:56 posix-access-hist.pdf
-rw-r--r-- 1 user1 internal  28907 Mar  4 04:56 posix-access-hist.eps
-rw-r--r-- 1 user1 internal   7505 Mar  4 04:56 op-counts.pdf
-rw-r--r-- 1 user1 internal  33986 Mar  4 04:56 file-access-write.eps
-rw-r--r-- 1 user1 internal  26953 Mar  4 04:56 file-access-shared.eps
-rw-r--r-- 1 user1 internal  38817 Mar  4 04:56 file-access-read.pdf
-rw-r--r-- 1 user1 internal 144600 Mar  4 04:56 file-access-read.eps
-rw-r--r-- 1 user1 internal  26920 Mar  4 04:56 pattern.eps
-rw-r--r-- 1 user1 internal   9091 Mar  4 04:56 file-access-write.pdf
-rw-r--r-- 1 user1 internal   7123 Mar  4 04:56 file-access-shared.pdf
-rw-r--r-- 1 user1 internal      0 Mar  4 04:56 summary.pdf
-rw-r--r-- 1 user1 internal      0 Mar  4 04:56 summary.aux
-rw-r--r-- 1 user1 internal   7165 Mar  4 04:56 pattern.pdf
-rw-r--r-- 1 user1 internal   8192 Mar  4 04:57 summary.log
-rw-r--r-- 1 user1 internal   4223 Mar  4 04:57 latex.output

could you please advice any fix on this issue?. Is there any process to validate the trace files? Here is the file with which i get issues - https://github.com/puneet336/darshan_issues/blob/main/puneet_global_fcst_id322003-322003_3-3-84486-17473022816075744451.darshan

shanedsnyder commented 2 years ago

Hi,

The problem is that files with .darshan_partial extension are incomplete and can confuse tools like darshan-job-summary -- these partial logs are created when some part of the Darshan shutdown procedure fails, leaving the log in an invalid state. Unfortunately, we have to assume partial logs like this are somehow broken since the shutdown procedure aborting could leave the log in a corrupted state.

That means we need to figure out why you cannot get the darshan-runtime library to shutdown completely as a first step. Are you sure you don't see any warnings from Darshan on stderr when running the application? Any reason in particular you are using the ufs: prefix for the DARSHAN_LOGPATH? We have sometimes recommended that when we know there is a faulty MPI-IO file system driver, but what happens if you don't use that prefix? You could also just try setting the environment variable DARSHAN_LOGHINTS="" to see if that somehow helps. Otherwise, we'll need to better diagnose how things are failing at runtime in Darshan -- in my experience, when .darshan_partial logs are generated there has always been some clue to what failed in some Darshan warning on stderr, but it's possible we are missing something.

puneet336 commented 2 years ago

if i dont use ufs: , then i get a 0 byte log file along with following message - grep -i warning slurm-12342.out |grep -v "OMP:" darshan_library_warning: unable to create log file /scratch/DARSHAN_LOGS1/puneet_global_fcst_id373796-373796_3-4-76302-14906523716168228325.darshan_partial. same with DATSHAN_LOGPATH without ufs + DARSHAN_LOGHINTS.

shanedsnyder commented 2 years ago

OK, so without ufs: Darshan fails to even create the log file, but if you use ufs: then the log file creates fine, it just never is properly finalized (meaning Darshan is somehow failing some other shutdown step).

I'm really not sure what to suggest as a workaround, it seems that there is at least some issue with the MPI-IO driver being used, as MPI_File_open won't even successfully create a file on that file system.

Could you provide more details on the system you're using? What MPI implementation/version? Is this a Lustre scratch file system you're trying to store the logs? These sorts of things are a real struggle for our team to debug as they typically rely on specific MPI versions or Lustre versions that we don't have access to test on. Ideally, we could just get access to the system to debug ourselves, but that's obviously not usually practical.