Native ARM Mac GPU usage (metal performance shaders)

tylerjereddy commented 1 year ago

Since Cirrus CI offers some native arm Mac (M chip) services, I was wondering if there might be some documentation/examples/options for using the GPU component (i.e., the metal performance shaders) when testing with i.e., torch which has an mps backend: https://pytorch.org/docs/stable/notes/mps.html

I did a little experiment here: https://github.com/tylerjereddy/scipy/pull/71

And found that there may be some restrictions that prevent practical usage in the open source tier: RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 1.70 GB). Tried to allocate 0 bytes on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Do you have any experience/guidance here? Is this expected? Is this disabled and you don't want us trying it? It would be very cool to be able to flush through GPUs in CI like that!

fkorotkov commented 1 year ago

I ran the following Pytorch example inside a Tart VM and indeed it seems not supported by underlying Virtualization.framework. Seems it's not supported yet but hopefully there will be some news on WWDC in two weeks. 🤞

tylerjereddy commented 1 year ago

Thanks, this would be pretty cool!

fkorotkov commented 1 year ago

With a little bit more investigation it seems the Virtualization.Framework should support Metal. It's mentioned in the last years WWDC video on 10:53. There is even ParavirtualizedGraphics.Framework that predates Virtualization.Framewerk which allegedly should use it.

But in my testing I don't see any graphics devices inside the VM:

Comparing to what I see on an M1 Mac Mini:

@edigaryev I know you diged into private APIs of Virtualization.Framework. Have you seem maybe any mentions of Metal?

edigaryev commented 1 year ago

@fkorotkov the paravirtualization actually seems to be used:

You can also check this by running ioreg inside of a VM:

% ioreg -n AppleParavirtGPU -r
+-o AppleParavirtGPU  <class AppleParavirtGPU, id 0x100000191, registered, matched, active, busy 0 (1 ms), retain 13>
  | {
  |   "IOClass" = "AppleParavirtGPU"
  |   "KDebugVersion" = 4294967296
  |   "IOPersonalityPublisher" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IOMatchedAtBoot" = Yes
  |   "IOReportLegendPublic" = Yes
  |   "AGCInfo" = {"fLastSubmissionPID"=134,"fSubmissionsSinceLastCheck"=0,"fBusyCount"=0}
  |   "IOProviderClass" = "AppleARMIODevice"
  |   "MetalPluginName" = "AppleParavirtGPUMetalIOGPUFamily"
  |   "IOProbeScore" = 0
  |   "SurfaceList" = ()
  |   "IONameMatch" = "paravirtualizedgraphics,gpu"
  |   "MetalPluginClassName" = "AppleParavirtDevice"
  |   "SchedulerState" = {"Stamps"=(),"BusyWorkQueues"=()}
  |   "CFBundleIdentifierKernel" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IOMatchCategory" = "IOAcceleratorES"
  |   "CFBundleIdentifier" = "com.apple.driver.AppleParavirtGPUIOGPUFamily"
  |   "IONameMatched" = "paravirtualizedgraphics,gpu"
  |   "PerformanceStatistics" = {"recoveryCount"=0,"In use system memory"=108962304,"Alloc system memory"=52527104}
  |   "IOGeneralInterest" = "IOCommand is not serializable"
  |   "IOReportLegend" = ({"IOReportChannels"=((1,6442450945,"Alloc system memory"),(2,6442450945,"In use system memory"),(3,6442450945,"GPU Restart Count")),"IOReportGroupName"="Internal Statistics","IOReportChan$
  |   "DisplayPortCount" = 1
  | }
  | 
  +-o AppleParavirtDisplay  <class AppleParavirtDisplay, id 0x1000001df, registered, matched, active, busy 0 (0 ms), retain 9>
  | +-o IOMobileFramebufferUserClient  <class IOMobileFramebufferUserClient, id 0x100000285, !registered, !matched, active, busy 0, retain 5>
  | +-o IOMobileFramebufferUserClient  <class IOMobileFramebufferUserClient, id 0x100000286, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x100000294, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x100000353, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000035a, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000035d, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x10000036a, !registered, !matched, active, busy 0, retain 5>
  +-o AppleParavirtDeviceUserClient  <class AppleParavirtDeviceUserClient, id 0x1000003fa, !registered, !matched, active, busy 0, retain 5>

I'm not sure as to why Apple’s Metal Performance Shaders don't work, though.

tylerjereddy commented 1 year ago

Perhaps @Developer-Ecosystem-Engineering might be able to (informally) point us in the right direction? I know they've been quite helpful with NumPy low-level development on M-series chips.

gluefox commented 6 months ago

I am running into the same issue as well.

Developer-Ecosystem-Engineering commented 6 months ago

Its currently not supported to run these types of workloads under virtualization.framework.

We understand the request!

cirruslabs / tart

Native ARM Mac GPU usage (metal performance shaders) #501