GPUOpen-Tools / gpu_performance_api

GPU Performance API for AMD GPUs
MIT License
250 stars 46 forks source link

Treat Samsung GPU same as AMD GPU. #65

Closed jcortell68 closed 2 years ago

jcortell68 commented 2 years ago

The Samsung GPU (Xclipse) is an AMD RDNA2 derivative but has a Samsung PCI vendor ID. As such, the GPA code needs to check both for AMD and Samsung. With this change and similar vendor ID checks in RenderDoc, RenderDoc is able to capture AMD perf counters when targeting an Xclipse GPU (requires bundling libGPUPerfAPIVK.so in the replayer server APK.)

jcortell68 commented 2 years ago

Sounds good. I made the changes you requested.

jcortell68 commented 2 years ago
  1. Today, Samsung only has one GPU--the Xclipse GPU. So, for now, the answer is no,.
  2. I don't know. I only saw the GPA code check for vendor ID. I'll take a closer look and see if I can answer this question.
  3. No. It is generally working for us as is, but I can't of course rule out additional changes if we discover a need for them.
jcortell68 commented 2 years ago

So., I've confirmed that device ID for the Samsung GPU is the same as one of the device IDs used by AMD, and that explains why I'm able to see AMD perf counters with my change. The question remains as to whether that ID is appropriate--whether it gives us counters that are all in fact relevant to the Samsung GPU. That will take some deeper digging. I'll report back

PLohrmannAMD commented 2 years ago

Okay, Thanks for confirming. With regards to whether the Device ID is appropriate and to verify whether the results being returned are expected, there are two things to confirm:

1) The Samsung driver is exposing the VK_AMD_gpa_interface extension according to the header available here, particularly the order of the VkGpaPerfBlockAMD enum: https://github.amd.com/Developer-Solutions/gpu_performance_api/blob/amd-main/source/third_party/AmdVkExt/vk_amd_gpa_interface.h

2) Within those blocks, the counter names at these locations match with the driver / hardware specs: https://github.com/GPUOpen-Tools/gpu_performance_api/blob/master/source/auto_generated/gpu_perf_api_counter_generator/gpa_hw_exposed_counters_gfx103.cc

I assume that the first item is already done, since you say that it seems to be working. The second is probably already done, as Xclipse is based on RDNA2 hardware. So I think you should be all set, but if you start to notice unexpected results, this might be an area to investigate.

jcortell68 commented 2 years ago
  1. For sure the mainline Samsung driver code exposes VK_AMD_gpa_interface. It didn't initially, but I've since made the required change to the driver. What this means is that GPA won't work for Xclipse GPUs currently in the field, lest the device gets updated to a firmware build that has the change.
  2. Thanks. I'll keep that in mind as I dig further into this device ID matter.
jcortell68 commented 1 year ago

@PLohrmannAMD It took a while, but I finally completed my investigation. Looks like I'll have to back up and undo these changes and replace them with changes that treats the Samsung GPU not as an AMD GPU but as its own thing with its own sets of perf counter metadata. For what it's worth, some of the perf counter definitions for AMD's device ID 0x73a0 are relevant/correct for Samsung's GPU (which also has device ID 0x73a0). However, a good number of them aren't and so the GPA library is currently giving out wrong information for all those. No need to back out this change immediately--it doesn't break any existing functionality, just provide wrong new functionality. I'll be working on a new submission and will hopefully have something soon.

PLohrmannAMD commented 1 year ago

Okay, Thank you for the update.

jcortell68 commented 1 year ago

@PLohrmannAMD The gpa_hw_exposed_counters_gfx103.cc you referenced is auto-generated, as told by the following comment in the file:

This file is autogenerated by the ConvertHWEnums project.

I can't find anything about ConvertHWEnums. Can you please point me in the right direction?

PLohrmannAMD commented 1 year ago

You'll have to mimic the format of this file and some other auto-generated ones manually. ConvertHwEnums is an internal tool we have to convert between different internal and public formats.

Other related files are:

I acknowledge (and have experienced) the enormous headache involved in manually replicating the files above. This should become easier in the future...

You'll likely need to update the PublicCounterDefinitions as well (maybe use a new section tag for [VKSX] (ie; Vulkan Samsung Xclipse):

And Update the PubliCounterCompiler to know about "SX" as a new GPUFamily to parse for:

You'll have to update some previously auto-generated .cmake files to recognize the new files you've created, then run the PublicCounterCompiler to generate some additional files. Manually include these where you see of reference the existing ones, and you should be all reconnected.

PLohrmannAMD commented 1 year ago

@PLohrmannAMD It took a while, but I finally completed my investigation. Looks like I'll have to back up and undo these changes and replace them with changes that treats the Samsung GPU not as an AMD GPU but as its own thing with its own sets of perf counter metadata. For what it's worth, some of the perf counter definitions for AMD's device ID 0x73a0 are relevant/correct for Samsung's GPU (which also has device ID 0x73a0). However, a good number of them aren't and so the GPA library is currently giving out wrong information for all those. No need to back out this change immediately--it doesn't break any existing functionality, just provide wrong new functionality. I'll be working on a new submission and will hopefully have something soon.

I was thinking we may want to put these recent changes into a separate branch on this repo and revert main back to the previous public release. We can merge that branch back in once you've confirmed everything is working.

If you don't get it resolved before we do our next public release, I'll do the updates as part of our release process.

jcortell68 commented 1 year ago

Thanks for the details. I'll start digging in.

As for putting the changes on a branch, I would just go ahead and revert the commit, period. There is nothing magical that needs to be kept there. Most of that has to be replaced with new logic

jcortell68 commented 1 year ago

Peter, the Build.md talks about running the PublicCounterCompiler tool, but doesn't mention how to build the tool. I need the tool to generate the C++ files for the Samsung GPU I'm adding to GPA. Can you provide some insight here? Presumably the tool has to be built on Windows. I tried going into the source\public_counter_compiler directory and running cmake from there, but I end up with errors. Not sure I'm going about this correctly.

PLohrmannAMD commented 1 year ago

Unfortunately it is currently written as a C# project, so it does need to be compiled and run in Windows. We are in the process of rewriting it in python so that it is available to all platforms and doesn't need to be compiled at all. This won't be completed in time for our next release, but hopefully soon after.

jcortell68 commented 1 year ago

so it does need to be compiled and run in Windows

In the meantime, can you provide guidance on that? This is effectively what I was asking for.

PLohrmannAMD commented 1 year ago

Ah. If you run 'python build\pre_build.py' on a Windows system it will generate a Visual Studio solution file (in cmake_bld\GPUPerfAPI.sln) which includes the PublicCounterCompiler project. Through Visual Studio you can build and run it.

I do realize this means you'll need to have Visual Studio, and the C# components installed as part of it. If this is not available to you, feel free to push up a branch and I can generate the necessary files for you.

Also Thanks to your inquiry on this, we'll bump up priority of re-writing this project :-)

jcortell68 commented 1 year ago

Great. I actually have Visual Studio installed (maybe not the C# components but I can install those). I'll give it a shot, thanks.

jcortell68 commented 1 year ago

Is the order of block instances listed in kHwVkGroupsGfxNnn arbitrary, or does it need to honor an ordering elsewhere (e.g., hardware spec). I get that other fields, e.g., kHwVkExposedCountersByGroupGfxNnn are dependent on the ordering of kHwVkGroupsGfxNnn, but is kHwVkGroupsGfxNnn 's ordering itself dependent on something else?

PLohrmannAMD commented 1 year ago

We try to make the order of that array match the order that the driver exposes them so that we can more easily validate that the driver is exposing what GPA needs. The order from the driver may match hardware specs, or may be arbitrary.

jcortell68 commented 1 year ago

Sorry; one more follow-up. Can you point to where in the AMD public driver code it's exposing the groups? I don't work in the driver and fishing around may lead me down the wrong path.

PLohrmannAMD commented 1 year ago

For Vulkan, our driver exposes the blocks according to the "VK_AMD_gpa_interface" extension. We have a copy of the headers included within this very repo. See here: https://github.com/GPUOpen-Tools/gpu_performance_api/blob/master/source/third_party/AmdVkExt/vk_amd_gpa_interface.h#L60

jcortell68 commented 1 year ago

Peter, sorry but more questions. I'm finding that creating the files for a generation is anything but intuitive. Making good progress but need more insight.

In gpa_hw_counter_vk_gfxnnn.cc, there's a set of const ints at the bottom of the file that seem to have some magical numbers, starting with kHwVkGpuTimeBottomToBottomDurationIndexGfxnnn . Is there any documentation on how these should be defined. Looks like the six first fields are increasing in number but how to assign the first one? How about kHwVkTimestampBlockIdsGfxnnn and kHwVkExposedCountersGroupCountGfxnnn?

PLohrmannAMD commented 1 year ago

I very much agree.

GPA essentially has a giant array of all the possible hardware events that could be profiled. After all these hardware events, GPA tacks on these additional "GpuTime" counters that can also be referenced in the derived counter equations. They are handled within GPA code as special timestamp queries that are populated by the driver (also referred to as a Timing Pass within GPA).

You could simply choose to not support these, and just set them to -1 (or MAX_UINT).

But, let's simplify things and imagine that you only wanted to support 6 different hardware events. These would be assigned 0-based indices, so the first would have index 0 and the 6th would have index 5, and therefore the timing indexes that you're asking about would start at index 6.

As far as other numbers, off the top of my head: kHwVkTimestampBlockIdsGfxnnn would be +1 after the number of internally supported hardware blocks kHwVkExposedCountersGroupCountGfx103 would be +1 after the number of exposed hardware blocks

ie: looking at the files for GFX103 - we have 535 hardware blocks that could be supported, so these are blockIds 0..534, and therefore the TimestampBlockId gets block Id 535. And of those 535 possible blocks, we're actually referencing counters from 407 of them (including the Timestamp Block).

Happy to help!

BTW, I'll be reverting your previously merged changes soon, as we both agreed is the right thing to do.

jcortell68 commented 1 year ago

Thanks for that explanation. The problem is that there are so many arrays of things in the gpa_*files that it's not clear which array exactly you are referring to. Can you spell out what the giant array is ultimately composed of in terms of fields declared in gpa_hw_counter_vk_gfxNNN.h?

The kHwVkTimestampBlockIdsGfxNNN and kHwVkExposedCountersGroupCountGfxNNN explanations make sense to me. We're good there.

jcortell68 commented 1 year ago

I think I might have answered my own question. This "giant array" you speak of is probably driven by kHwVkGroupsGfxNNN, with the field GpaCounterGroupDesc::num_counters dictating how many counters (events) are in each such block instance. In other words, if kHwVkGroupsGfxNNN had five elements (block instances) not including GPUTime, and each one had num_counters set to 10, then kHwVkGpuTimeBottomToBottomDurationIndexGfxnnn would be 50.

And in fact, the values of both kHwVkGpuTimeBottomToBottomDurationIndexGfxNNN and kHwVkTimestampBlockIdsGfxNNN can both be found in the last entry of kHwVkExposedCountersByGroupGfxNNN, which is the GPUTime entry. I think this latter definition/description is the most accurate.

Finally, one more place where this "giant array" is clearly visible is in public_counter_names_vk_gfxNNN.txt.

PLohrmannAMD commented 1 year ago

FYI - I just reverted the two previous merges that you had submitted, effectively restoring 'master' back to the 3.11.1 version of the code. Since they are reverted merge commits, you'll need to rebase any branches that you have back onto 'master'. I apologize in advance for any added work that may be involved with this.

jcortell68 commented 1 year ago

No problem at all. With some luck, I'll have a pull request before long that will implement the Samsung GPU support correctly.

PLohrmannAMD commented 1 year ago

New update! We just pushed our latest changes to 'master' which directly changes some of the code that you're working in. It should simplify some of it though. In particular, some of those arrays got turned into std::vector's so they automatically have a size() without needing to keep track of the count of elements. We also removed all the placeholder counter names between the ones that we actually use in our public counter definitions (note that associated counter indices still remain the same).

Let me know if these new changes raise any additional questions.

jcortell68 commented 1 year ago

Oh, cool. I'll update and adjust my gpa file generator tool. Thanks for the heads up

jcortell68 commented 1 year ago

Something unexpected in the newly generated files. By example, kSdma0CountersGfx103 went from being an array with 128 entries to a std::vector with no entries. I get the switch from array to std::vector, but that there's now no entries where they used to be before...can you explain that?

PLohrmannAMD commented 1 year ago

That is correct. These variables were also renamed from kCountersGfx103 -> kExposedCountersGfx103. We now only include the counters that are actually being referenced by the counter definitions. Since we don't actually reference any counters from the SDMA block, there is no need to expose any of those counters. They previously had placeholder names anyway, and would never be referenced publicly, so there really was no need for them to be included at all.

Depending on how you wrote your file generator, you may be able to simply write a new public counter definition which uses something from SDMA (or other blocks), regenerate the code, and have the referenced counters appear in the list. We try to go through a pretty thorough level of validation on the low level hardware counters before we reference them publicly, although issues do still pop up, so care should be taken about what you actually decide to make available as part of your pull request.

jcortell68 commented 1 year ago

I'm not sure I'm following on this statement

These variables were also renamed from kCountersGfx103 -> kExposedCountersGfx103

From what I can see, there are still two compilation units to differentiate exposed vs non-exposed counters gpa_hw_counter_gfx103.cc|h gpa_hw_exposed_counters_gfx103.cc|h The only difference is that now, the former has empty std:vectors for any blocks that doesn't expose counters, whereas before it had placeholders. I guess I assumed the placeholders were critical to making all that magic work. With the current setup, I'm not sure what the point of having both compilation units. They seem somewhat redundant. gpa_hw_exposed_counters_gfx103 could just have empty vectors for any blocks that don't expose counters. But alas, that's neither here nor there. You pointed out my observation is by design, and I can easily adapt my gpa file generator to adhere to it.

jcortell68 commented 1 year ago

BTW, I just now saw what you meant earlier about the variable renaming. You meant that in the file

    gpa_hw_counter_vk_gfxNNN.h|cc

there used to be

   extern GpaHardwareCounterDesc*           kVkCounterGroupArrayGfxNNN[];
   extern GpaHardwareCounterDesc*           kVkExposedCountersGroupArrayGfxNNN[];

but now there's just the latter (and it's a std::vector)

PLohrmannAMD commented 1 year ago

I'm not sure I'm following on this statement

These variables were also renamed from kCountersGfx103 -> kExposedCountersGfx103

From what I can see, there are still two compilation units to differentiate exposed vs non-exposed counters gpa_hw_counter_gfx103.cc|h gpa_hw_exposed_counters_gfx103.cc|h The only difference is that now, the former has empty std:vectors for any blocks that doesn't expose counters, whereas before it had placeholders. I guess I assumed the placeholders were critical to making all that magic work. With the current setup, I'm not sure what the point of having both compilation units. They seem somewhat redundant. gpa_hw_exposed_counters_gfx103 could just have empty vectors for any blocks that don't expose counters. But alas, that's neither here nor there. You pointed out my observation is by design, and I can easily adapt my gpa file generator to adhere to it.

We hadn't realized it earlier, but noticed the redundancy as well during our push up to GitHub. I've already created a task for us to revisit this and look into cleaning it up further. Yes, previously there was a "global" array of all the counters - and it was an array, and we would index into it, thus they were "critical to making all the magic work". Now the vectors get put into a Map with the global counter index being the key, so we can look them up rather than index into it, and therefore no longer need all the placeholders.