Open karanveersingh5623 opened 2 years ago
Hi Team I installed below package and tried again , above errors are gone but no file is generated .
dnf -y install texlive-collection-fontsrecommended
[root@c8-i16 ~]# darshan-job-summary.pl /darshan-logs/2022/8/17/root_python3_id549158-549158_8-17-52109-8519339879199717763_1.darshan --output pm1733_read.pdf
LaTeX generation (phase1) failed [256], aborting summary creation.
error log:
n.pdf>] (./summary.aux) ){/usr/share/texlive/texmf-dist/fonts/enc/dvips/base/8r
.enc}</usr/share/texlive/texmf-dist/fonts/type1/bitstrea/charter/bchb8a.pfb></u
sr/share/texlive/texmf-dist/fonts/type1/bitstrea/charter/bchr8a.pfb></usr/share
/texlive/texmf-dist/fonts/type1/bitstrea/charter/bchri8a.pfb></usr/share/texliv
e/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/share/texlive/texm
f-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/texlive/texmf-dist/
fonts/type1/public/amsfonts/cm/cmr12.pfb></usr/share/texlive/texmf-dist/fonts/t
ype1/public/amsfonts/cm/cmr8.pfb>
Output written on summary.pdf (5 pages, 109676 bytes).
Transcript written on summary.log.
Hi,
The error message in your update isn't super helpful (not your fault, just that this tool can be a little fragile with dependencies), but maybe the following package from apt-get could help: texlive-latex-extra
?
I see you also have a PyDarshan-related question and wanted to mention that we are re-implementing darshan-job-summary.pl
using PyDarshan currently. If you wanted to test that out, you would just run: python -m darshan summary <log_file_path>
It generates an HTML file that provides a summary report that is very similar to the darshan-job-summary.pl
script.
@shanedsnyder , thanks for coming back
I tried python -m darshan summary <log_file_path>
, works like a charm :) , just below error pops up which can be fixed by installing importlib_resources , you can check the trace below
ModuleNotFoundError: No module named 'importlib_resources'
[root@c8-i16 darshan-util]# python3 -m darshan summary /darshan-logs/2022/8/17/root_python3_id568393-568393_8-17-59452-15910884328909540566_1.darshan
Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib64/python3.6/site-packages/darshan/__main__.py", line 3, in <module>
main()
File "/usr/local/lib64/python3.6/site-packages/darshan/cli/__init__.py", line 145, in main
mod = importlib.import_module('darshan.cli.{0}'.format(subcmd))
File "/usr/lib64/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib64/python3.6/site-packages/darshan/cli/summary.py", line 11, in <module>
import importlib_resources
ModuleNotFoundError: No module named 'importlib_resources'
[root@c8-i16 darshan-util]# pip install importlib-resources
Collecting importlib-resources
Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)
Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.6/site-packages (from importlib-resources) (3.6.0)
Installing collected packages: importlib-resources
Successfully installed importlib-resources-5.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[root@c8-i16 darshan-util]# python3 -m darshan summary /darshan-logs/2022/8/17/root_python3_id568393-568393_8-17-59452-15910884328909540566_1.darshan
Report generated successfully.
Saving report at location: /root/darshan-3.4.0/darshan-util/root_python3_id568393-568393_8-17-59452-15910884328909540566_1_report.html
The need for importlib_resources
is normal on older versions of Python. Python 3.6
is end of life at this point as well, not even security fixes provided anymore.
@shanedsnyder , I am new to this tool , so trying to understand the trace-outputs , will be sending some queries , is there any slack channel or mailing lists where i can reach you guys , I am trying to benchmark NVMe PCIe Gen5 devices in HPC environment using DLIO benchmarks(synthetic) and MLperf HPC cosmoflow datasets (real workload) .
The need for
importlib_resources
is normal on older versions of Python. Python3.6
is end of life at this point as well, not even security fixes provided anymore.
Got it , thanks @tylerjereddy , so as you know I raised another issue with pydarshan running on windows , the thing is i am collecting traces from my linux machines and trying to parse darshan outputs on my desktop , so how can we get the dependency of libdarshan-util.so on my desktop running anaconda juypter ?
Not sure on Windows support yet, we can look into it a bit perhaps, but that will probably be a few days at least.
Have you used docker containers before? That's another option for getting access to Linux on Windows, though it has its own challenges/drawback of course.
Even easier for you might be to use gitpod, which gives you access to a containerized environment right in in your browser. I know for other OSS projects we sometimes recommend this for getting up and running quickly if the local machine isn't ideal.
For example, if you go to: gitpod.io/#https://github.com/darshan-hpc/darshan
You should be able to build the project as if you are on a Linux machine and even use Jupyter notebooks right in the browser. Here is what I see if I go there and start the build process (an example build process from our CI is here: https://github.com/darshan-hpc/darshan/blob/main/.github/workflows/main_ci.yml#L32):
If you're open to trying that, we could probably help you get up and running much easier since it is basically just Linux support at that point.
@tylerjereddy , this is cool !! How can I run this on gitpod ?--> https://github.com/darshan-hpc/darshan/blob/main/.github/workflows/main_ci.yml , I see all the steps to build mentioned in yml file I have installed jupyter notebook extension , will install pydarshan and check it after you tell me how can I use main_ci.yml in gitpod
@shanedsnyder , I am new to this tool , so trying to understand the trace-outputs , will be sending some queries , is there any slack channel or mailing lists where i can reach you guys , I am trying to benchmark NVMe PCIe Gen5 devices in HPC environment using DLIO benchmarks(synthetic) and MLperf HPC cosmoflow datasets (real workload) .
Cool, thanks for the background on what you're up to! Please feel free to post issues/bugs here, but we also have a mailing list (https://lists.mcs.anl.gov/mailman/listinfo/darshan-users) that might be better for more general discussions. Please keep us posted if you have any issues or have feedback on our tools based on your experience.
@karanveersingh5623 You don't even need to manually install if you don't want to follow those steps, you can just point gitpod
at a repo that has your log files in it (assuming they are not confidential). For example, let's say I wanted to quickly demonstrate darshan
Python interface on a log file. I could do this:
gitpod.io/#https://github.com/darshan-hpc/darshan-logs
(or wherever the log files of interest are--this is just an example location where we store some standard log files)pip install darshan
python -m darshan summary darshan_logs/imbalanced_io/imbalanced-io.darshan
gitpod
interface and inspect it locally in my browserHopefully that makes sense? If you need interactive IPython/Jupyter that should work well enough as well. Just save the results/download them so you don't lose them when you're done with the ephemeral gitpod
instance. So it is effectively a temporary Linux-on-demand type thing.
That said, we should probably help you get up and running locally somehow as well. There's no substitute for the convenience of local work long-term.
Apparently it even keeps track of your changes so you can come back to the instance later after shutting it down:
That said, I'd probably advise downloading results and/or pushing them to a repo or something just in case.
@tylerjereddy , it helps a lot :) , lemme try
@tylerjereddy , can we get this map/graph shared below from DLIO benchmark paper , its shows each process IO and compute times
they developed a holistic profiler that combines logs from deep learning frameworks, such as Tensorflow, together with I/O and storage systems logs to better understand the application behavior.
@karanveersingh5623 -- have you had a chance to generate the HTML summary reports I mentioned using PyDarshan? For Darshan logs that have DXT tracing data, or for logs generated by newer versions of Darshan (3.4.0+), those reports include a heatmap plot indicating read/write intensity across all processes over time. They summarize this activity currently for POSIX, MPI-IO, and STDIO interfaces. It seems like that type of plot is comparable to the types of plots you shared from that paper, but if you have specific feedback on things you think are missing, we could consider adding them to the report.
@shanedsnyder , below are the heat maps from running DLIO benchmark ( read workload with Posix) I understood heat map , time bins and top edge bar , confused about the right edge bar , what the colour contrast signifies , IO distribution across ranks are different ?
As a request if you can trace the timeline of I/O and compute, which shows that I/O and compute do not overlap in the application. ( The Highlighted test in the screen shot )
@shanedsnyder , I enabled checkpointing in DLIO benchmark for checking different behaviour , not able to generate data correctly , highlghted in the screenshot as well , total runtime was 27 secs and data seems to be missing
Module data incomplete due to runtime memory or record count limits
@tylerjereddy @shanedsnyder , how can we run darshan tool in Kubernetes ? Suppose I have my MPI supported benchmark in one Pod and Darshan Tool in another Pod or Just spin one POD with 2 containers , benchmark & darshan . Can you share some yml file where this has been done before ?
@tylerjereddy @shanedsnyder , any update on above queries ?
@tylerjereddy @shanedsnyder , how can we run darshan tool in Kubernetes ? Suppose I have my MPI supported benchmark in one Pod and Darshan Tool in another Pod or Just spin one POD with 2 containers , benchmark & darshan . Can you share some yml file where this has been done before ?
I don't think anyone on the team has Kubernetes+Darshan experience, so not sure how much help we can provide. Generally speaking, it's probably going to be easiest just to somehow ensure you can LD_PRELOAD the Darshan library when running the MPI benchmark container. Probably also simplifies things if your benchmark and Darshan libraries are built in the same environment/container.
For your question related to Darshan reporting incomplete module data, you'll need to provide a config file to Darshan at runtime telling it allocate more records for the POSIX module. There's notes on that here: https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_darshan_library_config_settings
The specific setting you want is MAX_RECORDS
, so in your config file you could try something like this:
# allocate 5000 file records for POSIX and MPI-IO modules
# (darshan only allocates 1024 per-module by default)
MAX_RECORDS 5000 POSIX,MPI-IO
You might also have to bump Darshan's max memory, too, e.g.:
# bump up Darshan's default memory usage to 8 MiB
MODMEM 8
@shanedsnyder , below are the heat maps from running DLIO benchmark ( read workload with Posix) I understood heat map , time bins and top edge bar , confused about the right edge bar , what the colour contrast signifies , IO distribution across ranks are different ?
The right bar graphs show you aggregate I/O over time for rank(s) -- at higher process counts you'd see more ranks aggregated together into single bars, but since you have only 4 proceses, it is showing a bar per-process. In your examples, I/O balance across ranks looks basically identical.
The color bar at the far right just gives you the range of I/O volume intensities reflected in the cells of the heatmap image. So brighter red indicates higher intensity I/O, like you see at the very beginning of the second example you shared (and which is confirmed by the bars at the the top of the heatmap summarizing I/O activity over all ranks over time).
As a request if you can trace the timeline of I/O and compute, which shows that I/O and compute do not overlap in the application. ( The Highlighted test in the screen shot )
I'm not exactly sure the details of what the figure you shared is showing, but the heatmap should indicate periods where no I/O activity is occurring (presumably, this is compute time). The heatmap is more detailed per-rank data than what you shared, but you should still be able to discern compute and I/O phases, like in this example (not the greatest, but you get the idea):
Darshan really can only detect when an application is performing I/O, it can't really provide much insight into the compute behavior of applications, so I'm not sure we can really mimic the figure you shared.
For your question related to Darshan reporting incomplete module data, you'll need to provide a config file to Darshan at runtime telling it allocate more records for the POSIX module. There's notes on that here: https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-runtime.html#_darshan_library_config_settings
The specific setting you want is
MAX_RECORDS
, so in your config file you could try something like this:# allocate 5000 file records for POSIX and MPI-IO modules # (darshan only allocates 1024 per-module by default) MAX_RECORDS 5000 POSIX,MPI-IO
You might also have to bump Darshan's max memory, too, e.g.:
# bump up Darshan's default memory usage to 8 MiB MODMEM 8
@shanedsnyder , I tried setting variables at runtime and in config file , still facing the issue , details below
[root@c8-i16 ~]# export DARSHAN_MODMEM=8000
[root@c8-i16 ~]# mpirun -n 1 python3 /root/dlio_benchmark/dlio_benchmark.py -f tfrecord -fa multi -nf 1024 -sf 512 -rl 131072 -tc 64 -bs 8 -ts 2097152 -tr 8 -tc 8 -df /mnt/dlio/test1 -gd 1 -go 1 -k 1
2022-08-28 09:47:08.323060: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-28 09:47:08.323085: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-08-28 09:47:10.146676: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-08-28 09:47:10.146791: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-08-28 09:47:10.146801: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-08-28 09:47:10.146817: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c8-i16): /proc/driver/nvidia/version does not exist
2022-08-28 09:47:10.147039: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-28 09:47:10.147156: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Generation done
Finalizing for rank 0
[root@c8-i16 ~]#
[root@c8-i16 ~]#
[root@c8-i16 ~]# echo $DARSHAN_CONFIG_PATH
/root/dlio_benchmark/darshan_config.cfg
[root@c8-i16 ~]#
[root@c8-i16 ~]#
[root@c8-i16 ~]# cat /root/dlio_benchmark/darshan_config.cfg
# allocate 5000 file records for POSIX and MPI-IO modules
# (darshan only allocates 1024 per-module by default)
MAX_RECORDS 5000 POSIX,MPI-IO,STDIO
# bump up Darshan's default memory usage to 8 MiB
MODMEM 8000
Below is the screnshot from the generated trace
Hi @karanveersingh5623 , a few things I noticed from what you shared:
DARSHAN_MODMEM
is in terms of MiB, so you are asking Darshan to pre-allocate 8 GiB of memory, which will probably be problematic. Maybe just try a value of 8
for starters (that's 8 total MiB for Darshan to use). Also, there's no need to set it via env variable and via config file, so maybe just use one of those (and drop from 8000 to 8).MAX_RECORDS 5000 POSIX
) -- it doesn't appear they are hitting any limits and this will prevent them from wasting additional memory.MAX_RECORDS 2048 DXT_POSIX,DXT_MPIIO
If that's still not enough for DXT, you can keep bumping that value up -- you may need to bump MODMEM more, too, depending on how much memory DXT is really needing. Sorry, I wish it wasn't so trial and error, but generally speaking we have to keep Darshan's runtime memory footprint low since it's used in production so much, so by default we really don't try to use much.
@tylerjereddy @shanedsnyder , how can we run darshan tool in Kubernetes ? Suppose I have my MPI supported benchmark in one Pod and Darshan Tool in another Pod or Just spin one POD with 2 containers , benchmark & darshan . Can you share some yml file where this has been done before ?
I don't think anyone on the team has Kubernetes+Darshan experience, so not sure how much help we can provide. Generally speaking, it's probably going to be easiest just to somehow ensure you can LD_PRELOAD the Darshan library when running the MPI benchmark container. Probably also simplifies things if your benchmark and Darshan libraries are built in the same environment/container.
I checked this , exporting LD_PRELOAD on host machine and running the docker container , no logs are getting collected
@tylerjereddy @shanedsnyder , how can I used non-mpi applications like docker and start the darshan trace . So the docker conatiner will be running the MPI application within , we need to capture that stats . Example below , which is not working . Please let me know if you need more information .
export DARSHAN_ENABLE_NONMPI=1 && export LD_PRELOAD=/usr/local/lib/libdarshan.so && docker run -d -v /mnt/dlio:/mnt --privileged 192.168.61.4:5000/dlio-benchmark:0.2 mpirun -n 1 python3 src/dlio_benchmark.py -f tfrecord -fa multi -nf 1024 -sf 512 -rl 128000 -tc 64 -bs 8 -ts 67108864 -df /mnt/test-pm9a3 -gd 1 -go 1 -k 1
@tylerjereddy @shanedsnyder Below is my dockerfile for the application
FROM centos:8
LABEL maintainer="karanv.singh@samsung.com"
RUN sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-Linux-*
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*
RUN dnf update -y
RUN dnf install epel-release wget perl autoconf gcc-c++ git libtool make readline-devel python3 python3-pip python3-devel -y
RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz
RUN gunzip -c openmpi-4.1.4.tar.gz | tar xf -
RUN cd openmpi-4.1.4/ \
&& ./configure --prefix=/usr/local \
&& make all install
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
RUN mpirun --version
ENV HOROVOD_WITH_MPI=1
RUN git clone https://github.com/hariharan-devarajan/dlio_benchmark.git
RUN python3 --version
RUN cd dlio_benchmark/ \
&& pip3 install --upgrade pip \
&& pip3 install -r requirements.txt
ENV PYTHONPATH=$PWD/src:$PYTHONPATH
ADD ./libdarshan.so /home/libdarshan.so
ENV DARSHAN_ENABLE_NONMPI=1
ENV LD_PRELOAD=/home/libdarshan.so
WORKDIR /dlio_benchmark
When running this container , below trace pops up , we need to get the libdarshan.so compiled binary to use a random log location ( user input) when any MPI or Non-MPI process starts , is it possible ?
[root@c8-i16 ~]# docker run -it -v /mnt/dlio:/mnt --privileged 192.168.61.4:5000/dlio-benchmark:0.2 bash
darshan_library_warning: unable to create log file /darshan-logs/2022/10/4/0_sh_id7-7_10-4-22929-15734917999010224131.darshan_partial.
darshan_library_warning: unable to create log file /darshan-logs/2022/10/4/0_sh_id9-9_10-4-22929-6917508148529346260.darshan_partial.
darshan_library_warning: unable to create log file /darshan-logs/2022/10/4/0_sh_id11-11_10-4-22929-11479446199797349966.darshan_partial.
darshan_library_warning: unable to create log file /darshan-logs/2022/10/4/0_bash_id1-13_10-4-22929-9805030444188143473.darshan_partial.
darshan_library_warning: unable to create log file /darshan-logs/2022/10/4/0_bash_id1-15_10-4-22929-702491304219886891.darshan_partial.
@tylerjereddy @shanedsnyder
Added ENV variable in the Docker --> DARSHAN_LOGFILE=
ENV LD_PRELOAD=/home/libdarshan.so
ENV DARSHAN_LOGFILE=/home/test.darshan <--
WORKDIR /dlio_benchmark
Check the below trace , its not able to create a log file , do we need to run or add binary like darshan-mk-log-dirs.pl , which libdarshan.so can understand and create log files in the user defined path
[root@c8-i16 ~]# docker run -it -v /mnt/dlio:/mnt --privileged 192.168.61.4:5000/dlio-benchmark:0.2 bash
darshan_library_warning: unable to create log file /home/test.darshan.
darshan_library_warning: unable to create log file /home/test.darshan.
darshan_library_warning: unable to create log file /home/test.darshan.
darshan_library_warning: unable to create log file /home/test.darshan.
darshan_library_warning: unable to create log file /home/test.darshan.
@karanveersingh5623 I'm not a Docker user, but Darshan is trying to write to /home. The docker run probably has to explicitly bind /home in the image?
so docker run -v /home:/home (which assumes you did a mkdir /home in the container definition)
Otherwise, I think you have things correct. You could alternatively set the DARSHEN_LOGFILE to land in /mnt.
It looks like you have built Darshan outside of the docker container and are copying the library inside of the container to instrument something running inside of it? I'm honestly not sure if/how that will work, to be honest, it's not something I've ever done.
If Kevin's suggestion doesn't work, you could also try just maybe setting DARSHAN_LOGPATH=cwd
/test.darshan, then Darshan library should generate it right where the app runs -- maybe that works and you can find a way to copy the file back out of the container?
Generally, I've encouraged folks to just build Darshan inside the container environment and then copy generated logs back out for analysis, it seems to be the easiest way to ensure you can get it working.
@shanedsnyder @kevin-harms , you guys are correct , I compiled darshan using logfile location
./configure --with-log-path=/darshan-logs
and then used darshan-mk-log-dirs.pl for initializing the directory structure .
I should have used --with-log-path-by-env option when compiling darshan , then the ENV DARSHAN_LOGFILE=/home/test.darshan will work in dockerfile
I tested without setting ENV DARSHAN_LOGFILE= in dockerfile and it works like charm , just need to mount the directory structure created by darshan-mk-log-dirs.pl in docker , below are some lines
# Copy libdarshan compiled binaries to docker container path
ADD ./lib /home/lib
ENV LD_PRELOAD=/home/lib/libdarshan.so
RUN mkdir -p /darshan-logs
#ENV DARSHAN_LOGFILE=/home/test.darshan
WORKDIR /dlio_benchmark
docker run -it -v /mnt/dlio:/mnt -v /darshan-logs:/darshan-logs --privileged 192.168.61.4:5000/dlio-benchmark:0.2 bash
Anyways , this is not what I wanted but will compile darshan again and share the outcome with you guys. you darshan guys are amazing , i like the way you share and care :)
@shanedsnyder @kevin-harms .....I want to run darshan for just a specific linux process....for example docker ls or watch -n....How can I achieve that ?
By setting the LD_PRELOAD for the whole environment, you are catching everything. You could set LD_PRELOAD only when running the specific binary like:
LD_PRELOAD=/home/lib/libdarshan.so watch -n
If you're having trouble, maybe you can post the exact example you want to test.
@kevin-harms Below is what I tested after your suggestion , no trace captured
[root@node001 ~]# LD_PRELOAD=/usr/local/lib/libdarshan.so docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
192.168.61.4:5000/cosmoflow-nvidia 0.4 3c7b8389abe7 2 months ago 13.2GB
[root@node001 ~]# ll /darshan-logs/2022/11/24/
total 0
@karanveersingh5623 you probably need to additionally specify DARSHAN_ENABLE_NONMPI=1 environment variable if you aren't already above in the last example you shared above? It's not set explicitly on the command line, but perhaps you already set it in your environment?
Kevin's suggestion for setting LD_PRELOAD for specific executables you want to run might be enough to do what you want, but if not I wanted to mention Darshan config files to you. You should see details on a DARSHAN_APP_EXCLUDE environment variable in those docs you can set that contains regular expressions describing application names to exclude instrumentation for. So, for your use case, maybe something like this would work:
export DARSHAN_APP_EXCLUDE="^ls,^watch"
That would ignore any app name starting with ls
or watch
.
@shanedsnyder , I tried your options but still cant able to generate trace
[root@k8s-worker86 ~]# export LD_PRELOAD=/usr/local/lib/libdarshan.so
[root@k8s-worker86 ~]# ll /darshan-logs/2022/12/1/
total 0
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]# echo $DARSHAN_ENABLE_NONMPI
[root@k8s-worker86 ~]# export DARSHAN_ENABLE_NONMPI=1
[root@k8s-worker86 ~]# export DARSHAN_APP_EXCLUDE="^ls,^watch"
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]# ls /darshan-logs/2022/12/1/
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
192.168.61.4:5000/nvidia_dlrm_spark 0.1 509d81ee0c37 2 days ago 3.5GB
nvidia_dlrm_spark latest 509d81ee0c37 2 days ago 3.5GB
nvcr.io/nvidia/cuda 10.2-cudnn8-runtime-ubuntu18.04 e415c5458f31 4 weeks ago 1.88GB
192.168.61.4:5000/dlio-benchmark 0.2 f919d016343d 8 weeks ago 5.62GB
192.168.61.4:5000/nvidia_dlrm_tf 0.1 b1209aba2ca1 3 months ago 12.2GB
nvidia_dlrm_tf latest b1209aba2ca1 3 months ago 12.2GB
192.168.61.4:5000/dlio-benchmark 0.1 d4fea6a0ac73 3 months ago 5.55GB
192.168.61.4:5000/dxt-explorer 0.2 da42f110ce2c 3 months ago 2.34GB
192.168.61.4:5000/dxt-explorer 0.1 39b8d8b929f4 6 months ago 1.44GB
centos 8 5d0da3dc9764 14 months ago 231MB
[root@k8s-worker86 ~]# ls /darshan-logs/2022/12/1/
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]#
[root@k8s-worker86 ~]# LD_PRELOAD=/usr/local/lib/libdarshan.so docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
192.168.61.4:5000/nvidia_dlrm_spark 0.1 509d81ee0c37 2 days ago 3.5GB
nvidia_dlrm_spark latest 509d81ee0c37 2 days ago 3.5GB
nvcr.io/nvidia/cuda 10.2-cudnn8-runtime-ubuntu18.04 e415c5458f31 4 weeks ago 1.88GB
192.168.61.4:5000/dlio-benchmark 0.2 f919d016343d 8 weeks ago 5.62GB
192.168.61.4:5000/nvidia_dlrm_tf 0.1 b1209aba2ca1 3 months ago 12.2GB
nvidia_dlrm_tf latest b1209aba2ca1 3 months ago 12.2GB
192.168.61.4:5000/dlio-benchmark 0.1 d4fea6a0ac73 3 months ago 5.55GB
192.168.61.4:5000/dxt-explorer 0.2 da42f110ce2c 3 months ago 2.34GB
192.168.61.4:5000/dxt-explorer 0.1 39b8d8b929f4 6 months ago 1.44GB
centos 8 5d0da3dc9764 14 months ago 231MB
[root@k8s-worker86 ~]# ls /darshan-logs/2022/12/1/
I just confirmed the same locally when using non-MPI mode on docker images
. I'm not sure what could be happening to cause no log being output. I'll try to have a closer look when I have some spare cycles.
@shanedsnyder , thanks for the update . The real need is that my slurm master node does not have darshan installed , only compute nodes has darshan binaries . I am using enroot + pyxis with slurm to run workload using docker images on compute nodes . How can I profile my workload using Sbatch where my workloads are running on enroot containers on compute nodes . I have already tested docker with darshan and its working fine . How Sbatch/ Srun fired from Master node will initiate darshan trace on enroot containers running as slurm jobs on compute nodes
I just confirmed the same locally when using non-MPI mode on
docker images
. I'm not sure what could be happening to cause no log being output. I'll try to have a closer look when I have some spare cycles. @shanedsnyder , anything I can get on this query , I want to profile an application that runs on container inside slurm cluster
Hi @karanveersingh5623, still not sure what the issue is but I was able to dig a little more.
In non-MPI mode, Darshan uses GNU constructor/destructor attributes (https://www.geeksforgeeks.org/__attribute__constructor-__attribute__destructor-syntaxes-c/) as a means to initialize/shutdown the Darshan library, respectively. For whatever reason, whenever running Docker commands, I see that the destructor is never called.
I wrote a simple shared library that only defines a constructor/destructor function and then tried to LD_PRELOAD it while running docker commands. It never runs any of the code in the destructor.
__attribute__((constructor)) void serial_init(void)
{
fprintf(stderr, "constructor called\n");
fflush(stderr);
return;
}
__attribute__((destructor)) void serial_finalize(void)
{
fprintf(stderr, "destructor called\n");
fflush(stderr);
assert(0);
return;
}
If you LD_PRELOAD that library and run basically anything else (e.g., ls
), it crashes at the end due to the assert, but not for docker commands.
I don't really know how to debug further off the top of my head, it might be easier to find an alternative way to use Darshan.
Is there a way you could build Darshan outside of the container, then bind mount it into the containers when they are launched, and then have the containers enable Darshan from there or something like that?
Hi Team
I am not able to generate PDF file from darshan trace log , please refer the below error trace