Open morrone opened 3 weeks ago
It looks to me in like int he top-level CMakefile of the RDC code we have this:
# When cmake -DBUILD_RVS=off, it will not build the librdc_rvs.so
# which requires the RocmValidationSuite
option(BUILD_RVS "Build targets for librdc_rvs.so" OFF)
So the RDC code really shouldn't be trying to load a library when it should know that the library doesn't exist. And if the library wasn't build, the most we should expect to see is warning that the "rvs" support does not exist. But is the "RocmValidationSuite" something that we would care about in normal production use of AMD GPUs?
Would we miss out on any of the monitoring fields in enum rdc_field_t when rvs support is missing?
You are correct, librdc_rvs.so
should not be loaded when the default is set to not build it. This is fixed in the amd-staging branch with https://github.com/ROCm/rdc/commit/660c5afaf49630781c1059ba6d30bae21743c32f.
For the error related to derived_counters.xml, it looks like the ROCM_PATH
environment variable is used to locate your ROCm installation. /opt/rocm
is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?
For the error related to derived_counters.xml, it looks like the
ROCM_PATH
environment variable is used to locate your ROCm installation./opt/rocm
is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?
That default needs to be fixed. The default should be derived from the configured install target path at build time.
Could you provide more information on the environment you're running on as well as the output of stat /proc/self/maps
? /opt/rocm
is only defaulted to if ROCM_PATH
isn't set and /proc/self/maps
cannot be opened or librocprofiler64.so is not found in /proc/self/maps
.
Could you provide more information on the environment you're running on as well as the output of
stat /proc/self/maps
?/opt/rocm
is only defaulted to ifROCM_PATH
isn't set and/proc/self/maps
cannot be opened or librocprofiler64.so is not found in/proc/self/maps
.
/proc/self/maps can be opened, and it does not contain librocprofiler64.so. But there is no reason to believe that my shell's "maps" file would contain librocprofiler64.so. That doesn't really tell us anything.
You might be overthinking things. There is no additional system information needed to understand the problem.
We have ROCm version 6.2.1 installed in:
/opt/rocm-6.2.1
Therefore, /opt/rocm is the incorrect default path for when ROCM_PATH is not set. For our choosen base installation path, the correct default is /opt/rocm-6.2.1
when ROCM_PATH is not set.
The code is making an incorrect assumption that everyone installs ROCm in exactly the same place.
Hi @morrone, does the error occur when you specify ROCM_PATH
to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use with ROCM_PATH
. The default /opt/rocm
path symlinks to the latest ROCm installed, you can view and change your default ROCm version with sudo update-alternatives --config rocm
(https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).
Hi @morrone, does the error occur when you specify
ROCM_PATH
to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use withROCM_PATH
. The default/opt/rocm
path symlinks to the latest ROCm installed, you can view and change your default ROCm version withsudo update-alternatives --config rocm
(https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).
We don't have an /opt/rocm symlink.
Settting ROCM_PATH would probably work, but setting a ROCM_PATH environment variable is somewhat challenging for something that runs as a daemon out of systemd. It means we need to synchronize the runtime configuration with knowledge of how things were compiled. It seems like effort that is kicked down the road from ROCm not handing that.
You can add environment variables to the environment file at /opt/rocm-6.2.1/etc/rdc_options
. For example
# Append 'rdc' daemon parameters here
RDC_OPTS="-u -d"
ROCM_PATH=/opt/rocm-6.2.1
#RDC_OPTS="-p 50051 -u -d"
You can add environment variables to the environment file at
/opt/rocm-6.2.1/etc/rdc_options
. For example
The rdc library will know how to find /opt/rocm-6.2.1/etc/rdc_options? But it can't find /opt/rocm-6.2.1 in other places in the code?
You should have a service file in your ROCm directory at /opt/rocm-6.2.1/libexec/rdc/rdc.service
, which sets the environment file. By default, it should point to /opt/rocm-6.2.1/etc/rdc_options
, see this line
https://github.com/ROCm/rdc/blob/a0f72904286950a9060d51404decc04b042cb85d/server/rdc.service.in#L16 and docs here: https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.
That is not really relevant here. I'm not talking about the rdc service or rdc.service file. I am talking about using the rdc libraries from a service that is not part of ROCm. Just using the rdc API. In particular, we are using the ldmsd monitoring daemon which uses librdc_bootstrap in embedded mode.
On Wed, Nov 20, 2024 at 12:07 PM zichguan-amd @.***> wrote:
You should have a service file in your ROCm directory at /opt/rocm-6.2.1/libexec/rdc/rdc.service, which sets the environment file. By default, it should point to /opt/rocm-6.2.1/etc/rdc_options, see this line
https://github.com/ROCm/rdc/blob/a0f72904286950a9060d51404decc04b042cb85d/server/rdc.service.in#L16 and docs here: https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.
— Reply to this email directly, view it on GitHub https://github.com/ROCm/rdc/issues/35, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLRYA7DPDYIJLDBTFTZQ32BTTY3AVCNFSM6AAAAABRJXQRR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGQ2DSMJSGY . You are receiving this because you were mentioned.Message ID: @.***>
Then you need to figure out how to set environment variable for ldmsd. Judging from ldms documentation, it already relies on environment variables: https://ovis-hpc-personal.readthedocs.io/projects/ldms/en/latest/ldms-quickstart.html#basic-configuration-and-running. If it is started with systemd
then there would've been a ldms.service
file that you can configure similar to how it's done with rdc.service
.
Then you need to figure out how to set environment variable for ldmsd.
Pretty much the only sane way to handle this is to dynamically generate the service file at compile time with the fixed path in it. And embedding the correct path at compile time is what ROCm (at this time) seems to be refusing to do.
So basically ROCm is forcing me to do the same thing that ROCm refuses to do.
whoops didn't see this issue until now. Hopefully we can figure something out by EOY.
@morrone I think you're right and setting it to install path is the path forward. I'll address in next couple weeks due to PTO conflicts.
and yes, trying to load modules when compiled without the support for them is naive: https://github.com/ROCm/rdc/blob/d8fec06bab31cff09f30aaebb0355a07a8bf8cc8/rdc_libs/rdc/src/RdcModuleMgrImpl.cc#L91
It would be much better to write something less headache inducing that uses compile definitions.
I've not been great at supporting non-standard build configurations (read: whatever we have in our CI build scripts).
If you have suggestions on that front - I'm all ears!
@morrone we can change the default ROCm path to be the path set by the cmake flag -DROCM_DIR
at build time, would that resolve your concerns?
@morrone we can change the default ROCm path to be the path set by the cmake flag
-DROCM_DIR
at build time, would that resolve your concerns?
Assuming that "-DROCM_DIR" is what the ROCm folks use when they build the standard rpms, then yes, that sounds like the right thing to me.
Problem Description
Note that we are installing ROCm 6.2.1 in the path:
/op/rocm-6.2.1/
The path includes the version to allow multiple versions of ROCm to be installed at the same time.
I have an application that uses RDC, and the RDC libraries are throwing the following ERRORs (which appear to be more like warnings, because things continue to run):
When I search the /opt/rocm-6.2.1 tree for "rvs", the only libraries that I find with that string are the following:
I have no idea if librvslib is a different name for librdc_rvs.so, or something else entirely.
In any event, there does not seem to be a "librdc_rvs.so" anywhere in the tree.
Next, the derived_counters.xml file is found in our install tree here:
Does RDC have the path improperly hard coded?
Operating System
RHEL8
CPU
Irrelevant
GPU
Irrelevant
ROCm Version
ROCm 6.2.1
ROCm Component
rdc
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response