Open puneet336 opened 2 years ago
Hi,
The problem is that files with .darshan_partial
extension are incomplete and can confuse tools like darshan-job-summary
-- these partial logs are created when some part of the Darshan shutdown procedure fails, leaving the log in an invalid state. Unfortunately, we have to assume partial logs like this are somehow broken since the shutdown procedure aborting could leave the log in a corrupted state.
That means we need to figure out why you cannot get the darshan-runtime library to shutdown completely as a first step. Are you sure you don't see any warnings from Darshan on stderr when running the application? Any reason in particular you are using the ufs:
prefix for the DARSHAN_LOGPATH? We have sometimes recommended that when we know there is a faulty MPI-IO file system driver, but what happens if you don't use that prefix? You could also just try setting the environment variable DARSHAN_LOGHINTS=""
to see if that somehow helps. Otherwise, we'll need to better diagnose how things are failing at runtime in Darshan -- in my experience, when .darshan_partial logs are generated there has always been some clue to what failed in some Darshan warning on stderr, but it's possible we are missing something.
if i dont use ufs: , then i get a 0 byte log file along with following message -
grep -i warning slurm-12342.out |grep -v "OMP:" darshan_library_warning: unable to create log file /scratch/DARSHAN_LOGS1/puneet_global_fcst_id373796-373796_3-4-76302-14906523716168228325.darshan_partial.
same with DATSHAN_LOGPATH without ufs + DARSHAN_LOGHINTS.
OK, so without ufs:
Darshan fails to even create the log file, but if you use ufs:
then the log file creates fine, it just never is properly finalized (meaning Darshan is somehow failing some other shutdown step).
I'm really not sure what to suggest as a workaround, it seems that there is at least some issue with the MPI-IO driver being used, as MPI_File_open
won't even successfully create a file on that file system.
Could you provide more details on the system you're using? What MPI implementation/version? Is this a Lustre scratch file system you're trying to store the logs? These sorts of things are a real struggle for our team to debug as they typically rely on specific MPI versions or Lustre versions that we don't have access to test on. Ideally, we could just get access to the system to debug ourselves, but that's obviously not usually practical.
Hi, I am using darshan v3.3.1 to analyze an MPI application.
for small codes like - http://pramodkumbhar.com/2020/03/i-o-performance-analysis-with-darshan/ i got a trace file with .darshan extension with size of ~4K, and i was able to generate reports (pdf format).
but with a larger code base, which does a lot of IO, the darshan toll was unable to write trace file in working directory so i had to set DATSHAN_LOGPATH to a different filesystem as - DARSHAN_LOGPATH=ufs:/scratch/DARSHAN_LOGS1/. Afterwards, i got
.darshan_partial
file of size ~1.3MB at the end of simulation. I renamed the file extension to.darshan
. I have attempted multiple runs, but i still get .darshan_partial files even though simulation terminated without error messages.Afterwards, I tried following command -
darshan-job-summary.pl file.darshan
The process has been running since last 1 hour without any output.
I noticed pdflatex process is in sleep state -
2874399 puneet 20 0 124728 43824 1804 S 0.0 0.0 0:00.27 pdflatex
with following content in temp directory -could you please advice any fix on this issue?. Is there any process to validate the trace files? Here is the file with which i get issues - https://github.com/puneet336/darshan_issues/blob/main/puneet_global_fcst_id322003-322003_3-3-84486-17473022816075744451.darshan