QuantEcon / project.lecture-mojo

Some Real World Examples using Mojo (Modular)
https://quantecon.github.io/project.lecture-mojo/
3 stars 2 forks source link

Add Example on Inventory Dynamics, Matrix Multiplication, Transpose, and Random Matrix Generation #7

Closed HumphreyYang closed 9 months ago

HumphreyYang commented 11 months ago

Hi @kp992,

I did a quick refactoring of the code related to matrix operations and added some more operations so that other examples can use it.

I tested the exact code on my own Linux hardware with an i7-13700HX, but the runtime and GFLOP are lower than what is reported in their documentation:

matrix A:
[[0.1315377950668335, 0.458650141954422],
[0.21895918250083923, 0.67886471748352051]]

matrix A.T:
[[0.1315377950668335, 0.21895918250083923],
[0.458650141954422, 0.67886471748352051]]

matrix A @ A.T:
[[0.22766214609146118, 0.34016281366348267],
[0.34016281366348267, 0.5088004469871521]]

Completed naive matmul in  550.57744700000001 ms
0.098078844845964061 GFLOP/s

Could you please try this on your hardware when you have time?

Many thanks in advance.

kp992 commented 11 months ago

Hi @HumphreyYang, thanks for refactoring. Here is result from my machine:

matrix A:
[[0.1315377950668335, 0.458650141954422],
[0.21895918250083923, 0.67886471748352051]]

matrix A.T:
[[0.1315377950668335, 0.21895918250083923],
[0.458650141954422, 0.67886471748352051]]

matrix A @ A.T:
[[0.22766214609146118, 0.34016281366348267],
[0.34016281366348267, 0.5088004469871521]]

Completed naive matmul in  1939.5247790000001 ms
0.027841871671184251 GFLOP/s

Machine info:

model name      : Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
stepping        : 13
microcode       : 0xf4
cpu MHz         : 2400.000
HumphreyYang commented 11 months ago

Many thanks @kp992 for helping.

Hi @mmcky, could you please kindly merge this PR? This includes an example of matrix multiplication and an example on random normal matrix generation, which will be helpful for converting other lectures.

Many thanks in advance.

mmcky commented 11 months ago

thanks @HumphreyYang -- how does it compare with numpy etc?

HumphreyYang commented 11 months ago

Hi @mmcky,

Sorry for the delay. My server connection was down after my last push for some reason : (

I will copy the results once I get home and reload the server.

HumphreyYang commented 11 months ago

Hi @mmcky,

Here are the results for matrix multiplication of X and X.T with size 512x512:

JAX on Nvidia T4:

JAX takes 0.677983599999834 ms
395.93207859314845 GFLOP/s

Numpy on 4 cores of AMD EPYC 7763 64-Core Processor:

Numpy takes 4.855720940000197 ms
55.28230705943104 GFLOP/s

  Mojo’s average performance on 4 cores of AMD EPYC 7763 64-Core Processor using both parallelization and vectorization:

Mojo takes 8.0141819999999999 ms
33.573925860702865 GFLOP/s

The discrepancies in experiments are very large. Occasionally, the performance can go up to around 160 GFLOPS/s. I would be inclined to conclude that Mojo is faster than Numpy but less efficient than JAX, which is very strong in matrix computation. The best performance in their notebook is 180.14895626059092 GFLOP/s, which still does not beat JAX.

Should we merge this refactored code as an example?

mmcky commented 11 months ago

Thanks @HumphreyYang I am keep to merge this example. Just trying to understand the results.

The discrepancies in experiments are very large.

I wonder if this is due to the size of the problem being .. do the time discrepancies persist as the problem get's bigger?

mmcky commented 11 months ago

@HumphreyYang have you seen this

https://docs.modular.com/mojo/notebooks/Matmul.html

Might be some useful info in this doc

HumphreyYang commented 11 months ago

Hi @mmcky

@HumphreyYang have you seen this

https://docs.modular.com/mojo/notebooks/Matmul.html

Might be some useful info in this doc

The matrix multiplication used the algorithm that is listed in this post with parallelization and vectorization (link). Therefore, I am also surprised to see the result is not close to what they have stated. I think it might be related to the underlying hardware.

HumphreyYang commented 11 months ago

Hi @mmcky,

Another reason why numpy and jax are faster in matrix multiplication is the use of well-optimized routines, such as BLAS. mojo can be faster at the interpreter level, but it is not yet capable (or is not publically capable) of using these well-optimized routines.

I think we might need to let this example be as it is (given it is the most optimized version mojo provides) and use it to support implementations of other examples.

mmcky commented 11 months ago

Therefore, I am also surprised to see the result is not close to what they have stated

@HumphreyYang let's catchup and chat about this. It doesn't make sense to me that our results are different to the ones on the website in terms of improvements.

HumphreyYang commented 11 months ago

@HumphreyYang let's catchup and chat about this. It doesn't make sense to me that our results are different to the ones on the website in terms of improvements.

Looking forward to our meeting! One reason is that they are comparing with the naive python implementation (i.e., double for loop implementation) but we are comparing it with NumPy and JAX implementations.

mmcky commented 9 months ago

@HumphreyYang merging this but let's keep the discussion going.