Puzzling performance CPU/GPU

gedwald commented 3 years ago

I am running a ML notebook via Jupyter, and apple's TF 2.4 alpha v 0.1. standard Numpy. Simple Sequential TF-Keras model with around 7M parameters.

with "device_name='CPU'" the epoch times are around 90 seconds, with "device_name='GPU'" the epoch times are around 600 seconds. Everything else in the notebook is fixed. Both runs from zero, that is, load data, start with nothing.

Interestingly, the second run, with GPU gives a validation accuracy of ~57%, the CPU run only ~40%. This is more of a difference than I observe in multiple separate runs on CPU only. In the GPU case, the GPU history monitor shows around 85% utilization the whole time (~15 hours). The M1 case is at room temperature, and no fan can be heard.

anna-tikhonova commented 3 years ago

Thank you very much for reporting this! Could you please provide a reproducible test case, so we can investigate locally?

gedwald commented 3 years ago

Hello, yes I can do that, happy to.

I have a simple notebook with a model (that doesn’t do anything useful) and input data files (numpy binaries) that are probably unnecessarily large.

How would you want me to provide this?

The files are too large for my mail service, over 160Mb as-is. I could try to reduce The data size, but I don’t know if the problem is size-dependent.

regards t

On 23.12.2020, at 20:14, anna-tikhonova notifications@github.com wrote:

Thank you very much for reporting this! Could you please provide a reproducible test case, so we can investigate locally?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apple/tensorflow_macos/issues/88#issuecomment-750436377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC6TJBBJKCL4ROUJGR5UFJ3SWI6RNANCNFSM4VFMDDHQ.

gedwald commented 3 years ago

Dear TF friends

Attached is a notebook (as PDF) demonstrating my GPU/CPU problem.

I think that for your purposes, you may be faster generating your own input data than figuring out how to get mine. However, the original Notebook and the data files are available, if you want. The data is Not meaningful, and isn’t covered by any non-disclosure clauses.

The files are:

trnx - 10.000 rows of 2048 numbers, originally integers, read as CSV and dumped as float32 npy trny - 10.000 element 1/0 shape [10.000,] tstx - 5000 rows 2048 numbers, originally integers, read from CSV, dumped as float32 npy tsty - 5000 element 1/0

For the purposes of this CPU/GPU test, I don’t think the values are important, you can assign random integers into trnx and tstx in the range [0,10000], and assign random 0/1 as labels in the -y files/arrays.

In all my tests I get CPU:GPU time ratios between 1:7 and 1:8.
With the original data volume, (around 140K rows) I get: CPU: 81sec avg; GPU: 629 sec avg. With 10K rows, I get: CPU: 6 secs, GPU: 47 secs.

In between data volumes (eg. 20K rows) show similar ratios.

In all runs, the GPU utilisation rate of process “python” is close to 88%.

Please let me know if I can provide anything more, would love to assist.

Regards and Happy holidays

Tryggvi Edwald

On 24.12.2020, at 15:16, Tryggvi Edwald tryggvi.edwald@gmail.com wrote:

Hello, yes I can do that, happy to.

I have a simple notebook with a model (that doesn’t do anything useful) and input data files (numpy binaries) that are probably unnecessarily large.

How would you want me to provide this?

The files are too large for my mail service, over 160Mb as-is. I could try to reduce The data size, but I don’t know if the problem is size-dependent.

regards t

On 23.12.2020, at 20:14, anna-tikhonova <notifications@github.com mailto:notifications@github.com> wrote:

Thank you very much for reporting this! Could you please provide a reproducible test case, so we can investigate locally?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apple/tensorflow_macos/issues/88#issuecomment-750436377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC6TJBBJKCL4ROUJGR5UFJ3SWI6RNANCNFSM4VFMDDHQ.

mflis commented 3 years ago

Hi, I can also observe similar problem. In my case GPU is roughly 2 times slower than CPU. What's interesting, standard version of tensorflow on CPU is a bit faster than version provided from this repo (but that can be explain by warning about layers that could not be processed by autograph). Here's link to my executable example: https://github.com/mflis/gpu_experiments . Please let me know if you would like more information about my setup or to perform more experiments.

gedwald commented 3 years ago

(in reference to my original post "Puzzling Performance ..."):

I should have mentioned that while the GPU version is running, Swap space increases greatly. During a CPU run, the Swap space stays constant at a couple hundred MB, but in the GPU run (exactly the same setup as the CPU run) SWAP space increases to 2.4GB (for that particular model). Is it possible the GPU version doesn't understand the Unified memory of the M1, and re-copies the data, wasting both time and swap space? best regards, Happy New Year! T. Edwald

rkraghu88 commented 3 years ago

Hi Folks!! Happy new year. I am facing some peculiar issues in my MBA M1 16GB with tensorflow and Dense networks. I have listed down a few below: (I'm running a Deep RL algorithm) 1) Whenever I set mlcompute to 'cpu' I get large training errors. Also GPU occupancy is 90%. Don't know why!! 2) When I shift it to GPU it runs a bit slower. 3) When I run two instances of the same program in parallel (in 'gpu' mode). I see that the execution time of the programs reduce by about 75% (What!! 😳). GPU goes to 100% usage. Also, when I do this, I see 4 high efficiency cores are active (instead of one when I run a single instance). I guess the cpu resources are also being shared during training!!

Also I have never seen my python programs with TF using any of the High performance cores (whatever be the temperature). Any inputs from TF developers??

cigoic commented 3 years ago

Somehow, my M1's Neural Engine stays calm while using TensorFlow...😂

How to boost the power of M1 via TensorFlow? Any official guides? Thanks.

beraekoklu commented 3 years ago

Yes, it is really slow. I would expect apple's gpu and ml engine to perform much better than Kaggle's systems, with or without Kaggle's GPU

gedwald commented 3 years ago

Rc0 alpha.1 works the same :-(

On 23.12.2020, at 20:14, anna-tikhonova notifications@github.com wrote:

Thank you very much for reporting this! Could you please provide a reproducible test case, so we can investigate locally?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/apple/tensorflow_macos/issues/88#issuecomment-750436377, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC6TJBBJKCL4ROUJGR5UFJ3SWI6RNANCNFSM4VFMDDHQ.

stephanboner commented 3 years ago

For me it's exactly the same as for @mflis sadly. using a MBP 2014 with i7 and GT 750M

beraekoklu commented 3 years ago

I don't really mind the difference between GPU and CPU, but I think this isn't very well optimized for M1 macs, as it's calm and isn't being worked.

cigoic commented 3 years ago

Will the performance issue relate to lower-level ML framework?

I just followed a tutorial using CreatML and ended up pool performance, either. (Only 1/4 CPU and 1/2 GPU utilization)

CreateML - 121,444 image Microsoft COCO object detection dataset

anna-tikhonova commented 3 years ago

@gedwald Thank you very much for providing a reproducible test case. We will investigate locally and report back.

apple / tensorflow_macos

Puzzling performance CPU/GPU #88