ROCm / rdc

RDC
https://rocm.docs.amd.com/projects/rdc/en/latest/
MIT License
23 stars 11 forks source link

[Issue]: RDC libraries try to dlopen missing library, and read xml file from wrong path #35

Open morrone opened 3 weeks ago

morrone commented 3 weeks ago

Problem Description

Note that we are installing ROCm 6.2.1 in the path:

/op/rocm-6.2.1/

The path includes the version to allow multiple versions of ROCm to be installed at the same time.

I have an application that uses RDC, and the RDC libraries are throwing the following ERRORs (which appear to be more like warnings, because things continue to run):

1730927059.242 ERROR RdcModuleMgrImpl.cc(65): Failed to insert module: N3amd3rdc9RdcRVSLibE
Fail to open librdc_rvs.so: librdc_rvs.so: cannot open shared object file: No such file or directory
1730927059.295 ERROR RdcRocpLib.cc(208): failed to open ROCP_METRICS: /opt/rocm/libexec/rocprofiler/counters/derived_counters.xml
1730927059.295 ERROR RdcRocpLib.cc(46): Rocp related function will not work.
1730927059.295 ERROR RdcModuleMgrImpl.cc(65): Failed to insert module: N3amd3rdc10RdcRocpLibE
rocprofiler path could not be set

When I search the /opt/rocm-6.2.1 tree for "rvs", the only libraries that I find with that string are the following:

/opt/rocm-6.2.1/lib/librvslib.so
/opt/rocm-6.2.1/lib/librvslib.so.0
/opt/rocm-6.2.1/lib/librvslib.so.0.0.60201

I have no idea if librvslib is a different name for librdc_rvs.so, or something else entirely.

In any event, there does not seem to be a "librdc_rvs.so" anywhere in the tree.

Next, the derived_counters.xml file is found in our install tree here:

/opt/rocm-6.2.1/libexec/rocprofiler/counters/derived_counters.xml

Does RDC have the path improperly hard coded?

Operating System

RHEL8

CPU

Irrelevant

GPU

Irrelevant

ROCm Version

ROCm 6.2.1

ROCm Component

rdc

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

morrone commented 3 weeks ago

It looks to me in like int he top-level CMakefile of the RDC code we have this:

# When cmake -DBUILD_RVS=off, it will not build the librdc_rvs.so
# which requires the RocmValidationSuite
option(BUILD_RVS "Build targets for librdc_rvs.so" OFF)

So the RDC code really shouldn't be trying to load a library when it should know that the library doesn't exist. And if the library wasn't build, the most we should expect to see is warning that the "rvs" support does not exist. But is the "RocmValidationSuite" something that we would care about in normal production use of AMD GPUs?

Would we miss out on any of the monitoring fields in enum rdc_field_t when rvs support is missing?

harkgill-amd commented 3 weeks ago

You are correct, librdc_rvs.so should not be loaded when the default is set to not build it. This is fixed in the amd-staging branch with https://github.com/ROCm/rdc/commit/660c5afaf49630781c1059ba6d30bae21743c32f.

For the error related to derived_counters.xml, it looks like the ROCM_PATH environment variable is used to locate your ROCm installation. /opt/rocm is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?

morrone commented 3 weeks ago

For the error related to derived_counters.xml, it looks like the ROCM_PATH environment variable is used to locate your ROCm installation. /opt/rocm is the default if the variable is not set (ref). Could you please try setting the variable and check if the error is resolved?

That default needs to be fixed. The default should be derived from the configured install target path at build time.

harkgill-amd commented 2 weeks ago

Could you provide more information on the environment you're running on as well as the output of stat /proc/self/maps? /opt/rocm is only defaulted to if ROCM_PATH isn't set and /proc/self/maps cannot be opened or librocprofiler64.so is not found in /proc/self/maps.

morrone commented 2 weeks ago

Could you provide more information on the environment you're running on as well as the output of stat /proc/self/maps? /opt/rocm is only defaulted to if ROCM_PATH isn't set and /proc/self/maps cannot be opened or librocprofiler64.so is not found in /proc/self/maps.

/proc/self/maps can be opened, and it does not contain librocprofiler64.so. But there is no reason to believe that my shell's "maps" file would contain librocprofiler64.so. That doesn't really tell us anything.

You might be overthinking things. There is no additional system information needed to understand the problem.

We have ROCm version 6.2.1 installed in:

/opt/rocm-6.2.1

Therefore, /opt/rocm is the incorrect default path for when ROCM_PATH is not set. For our choosen base installation path, the correct default is /opt/rocm-6.2.1 when ROCM_PATH is not set.

The code is making an incorrect assumption that everyone installs ROCm in exactly the same place.

zichguan-amd commented 2 weeks ago

Hi @morrone, does the error occur when you specify ROCM_PATH to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use with ROCM_PATH. The default /opt/rocm path symlinks to the latest ROCm installed, you can view and change your default ROCm version with sudo update-alternatives --config rocm (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).

morrone commented 2 weeks ago

Hi @morrone, does the error occur when you specify ROCM_PATH to your ROCm installation directory? If you have multiple versions of ROCm installed, you need to specify which ROCm version to use with ROCM_PATH. The default /opt/rocm path symlinks to the latest ROCm installed, you can view and change your default ROCm version with sudo update-alternatives --config rocm (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/post-install.html).

We don't have an /opt/rocm symlink.

Settting ROCM_PATH would probably work, but setting a ROCM_PATH environment variable is somewhat challenging for something that runs as a daemon out of systemd. It means we need to synchronize the runtime configuration with knowledge of how things were compiled. It seems like effort that is kicked down the road from ROCm not handing that.

zichguan-amd commented 1 week ago

You can add environment variables to the environment file at /opt/rocm-6.2.1/etc/rdc_options. For example

# Append 'rdc' daemon parameters here
RDC_OPTS="-u -d"
ROCM_PATH=/opt/rocm-6.2.1
#RDC_OPTS="-p 50051 -u -d"
morrone commented 1 week ago

You can add environment variables to the environment file at /opt/rocm-6.2.1/etc/rdc_options. For example

The rdc library will know how to find /opt/rocm-6.2.1/etc/rdc_options? But it can't find /opt/rocm-6.2.1 in other places in the code?

zichguan-amd commented 1 week ago

You should have a service file in your ROCm directory at /opt/rocm-6.2.1/libexec/rdc/rdc.service, which sets the environment file. By default, it should point to /opt/rocm-6.2.1/etc/rdc_options, see this line https://github.com/ROCm/rdc/blob/a0f72904286950a9060d51404decc04b042cb85d/server/rdc.service.in#L16 and docs here: https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.

morrone commented 1 week ago

That is not really relevant here. I'm not talking about the rdc service or rdc.service file. I am talking about using the rdc libraries from a service that is not part of ROCm. Just using the rdc API. In particular, we are using the ldmsd monitoring daemon which uses librdc_bootstrap in embedded mode.

On Wed, Nov 20, 2024 at 12:07 PM zichguan-amd @.***> wrote:

You should have a service file in your ROCm directory at /opt/rocm-6.2.1/libexec/rdc/rdc.service, which sets the environment file. By default, it should point to /opt/rocm-6.2.1/etc/rdc_options, see this line

https://github.com/ROCm/rdc/blob/a0f72904286950a9060d51404decc04b042cb85d/server/rdc.service.in#L16 and docs here: https://github.com/ROCm/rdc?tab=readme-ov-file#start-rdcd-using-systemd.

— Reply to this email directly, view it on GitHub https://github.com/ROCm/rdc/issues/35, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABLRYA7DPDYIJLDBTFTZQ32BTTY3AVCNFSM6AAAAABRJXQRR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZGQ2DSMJSGY . You are receiving this because you were mentioned.Message ID: @.***>

zichguan-amd commented 1 week ago

Then you need to figure out how to set environment variable for ldmsd. Judging from ldms documentation, it already relies on environment variables: https://ovis-hpc-personal.readthedocs.io/projects/ldms/en/latest/ldms-quickstart.html#basic-configuration-and-running. If it is started with systemd then there would've been a ldms.service file that you can configure similar to how it's done with rdc.service.

morrone commented 1 week ago

Then you need to figure out how to set environment variable for ldmsd.

Pretty much the only sane way to handle this is to dynamically generate the service file at compile time with the fixed path in it. And embedding the correct path at compile time is what ROCm (at this time) seems to be refusing to do.

So basically ROCm is forcing me to do the same thing that ROCm refuses to do.

dmitrii-galantsev commented 1 week ago

whoops didn't see this issue until now. Hopefully we can figure something out by EOY.

dmitrii-galantsev commented 1 week ago

@morrone I think you're right and setting it to install path is the path forward. I'll address in next couple weeks due to PTO conflicts.

dmitrii-galantsev commented 1 week ago

and yes, trying to load modules when compiled without the support for them is naive: https://github.com/ROCm/rdc/blob/d8fec06bab31cff09f30aaebb0355a07a8bf8cc8/rdc_libs/rdc/src/RdcModuleMgrImpl.cc#L91

It would be much better to write something less headache inducing that uses compile definitions.
I've not been great at supporting non-standard build configurations (read: whatever we have in our CI build scripts).

If you have suggestions on that front - I'm all ears!

zichguan-amd commented 6 days ago

@morrone we can change the default ROCm path to be the path set by the cmake flag -DROCM_DIR at build time, would that resolve your concerns?

morrone commented 1 day ago

@morrone we can change the default ROCm path to be the path set by the cmake flag -DROCM_DIR at build time, would that resolve your concerns?

Assuming that "-DROCM_DIR" is what the ROCm folks use when they build the standard rpms, then yes, that sounds like the right thing to me.