Try Dagger for distributed TN contraction

jofrevalles commented 1 year ago

This week I have been experimenting with the Dagger library to perform matrix multiplication. The key idea is to distribute the matrices over some workers so we can compute the result distributedly. In Dagger, this is done by splitting the matrix in some Chunks and finding the right chunk size is the tricky part. I have tested Dagger for now using only one node in MareNostrum4, and using 1 core per worker.

Here is an example of the results, where in this case we compute the matrix product of two (4096,4096) matrices, and we show the contraction time as a function of the number of workers:

In this plot we can see that, since we distribute the matrix using chunks of size (2024, 2024) (so 4 chunks per matrix), we can only get distribution of the matrices up to 8 workers (4 chunks per matrix). Moreover, we can compare how does this matrix multiplication compare with the case where the chunk is the same size of the matrix, and we can see that in this trivial case the computation can not be distributed.

Since this already works quite well, I propose we can test the multinode scenario:

[ ] #100

This is the code I used:

```julia using Dagger using Statistics using Distributed using BenchmarkTools # N = Int(parse(Int, ARGS[2])) # Number of workers blocks_per_dim = 2048 # Size of blocks per dimension dim_size = 4096 # Size of each matrix dimension dims = 2 # It's a matrix samples = 10 seconds = 30 # Set the JULIA_PROJECT environment variable to ensure that the worker processes use the correct project environment project_path = "/gpfs/scratch/bsc21/bsc21504/CUCO/julia/Tenet_distributed" ENV["JULIA_PROJECT"] = project_path # Get the list of nodes allocated by SLURM nodes = read(`srun -l /bin/hostname`, String) nodes = split(nodes, '\n') workers_per_node = Int(parse(Int, ARGS[2])) # Extract hostnames and remove duplicates and empty strings nodes = unique([split(node, ": ")[2] for node in nodes if node != ""]) println("nodes: $nodes") n_workers = workers() println("Number of workers: $n_workers") # Get mean available memory per worker @everywhere function available_memory() mem_str = read(`free -g`, String) mem_lines = split(mem_str, '\n') mem_parts = split(mem_lines[2]) return parse(Int, mem_parts[3]) # Available memory is in the third column end # Check the available memory on each worker mem_workers = [] for worker in workers() push!(mem_workers, @fetchfrom worker available_memory()) # println("Worker $worker has $mem GB available memory") end mean_mem = mean(mem_workers) std_mem = std(mem_workers) @everywhere begin using Dagger using Dates blocks_per_dim = $blocks_per_dim dims = $dims dim_size = $dim_size end start_time = Dates.now() blocks = Blocks([dim_size ÷ blocks_per_dim for _ in 1:dims]...) println("blocks size: $([dim_size ÷ blocks_per_dim for _ in 1:dims])") A = rand(blocks, Complex{Float64}, [dim_size for _ in 1:dims]...) B = rand(blocks, Complex{Float64}, [dim_size for _ in 1:dims]...) # contract the tensor network task = @benchmark fetch(A * B) samples=samples seconds=seconds time_taken = Dates.now() - start_time mean_time = mean(task.times ./ 1e9) # converted to seconds std_time = std(task.times ./ 1e9) # converted to seconds memory_estimate_kb = task.memory / 1024 # converted to kilobytes memory_estimate_mb = memory_estimate_kb / 1024 # converted to megabytes allocations = task.allocs println("mean time: $mean_time, std time: $std_time, memory estimate: $memory_estimate_mb, allocations: $allocations") # Write the operation time to a file open("darray_matrix_2.txt", "a") do io write(io, "workers=$n_workers, workers_per_node=$workers_per_node, nodes=$nodes, dim_size=$dim_size, block_size_per_dim=$blocks_per_dim, samples=$samples, seconds=$seconds, memory_per_worker=$mean_mem, std_mem=$std_mem, total_time_taken=$time_taken, times_ns_=$(task.times), mean_time_taken_s_=$mean_time, std_time_taken_s_=$std_time, mean_time_taken_ns_=$(mean(task.times)), std_time_taken_ns_=$(std(task.times)), memory_estimate_kb=$memory_estimate_kb, memory_estimate_mb=$memory_estimate_mb, allocations=$allocations \n") end ```

And in the slurm script I have the line julia -p $2 --project=., where in 2 there is the number of workers and we are in the directory of the julia project we want to start. See our Github Discussion on how to set up julia in MareNostrum for more information, and don't doubt to ask if you have any questions.

Todorbsc commented 11 months ago

Using Dagger library (in specific Dagger.@spawn) to distribute multiple matrix multiplication tasks, I was able to obtain the following computation times for different number of workers/nodes in MN4.

mn-48-dynamic-procs-times

It would be recommended to try to increase the tasks' workload (i.e. some operation computationally more expensive than matrix multiplication) so the overhead time taken by the workers' initialization can be neglected with respect to their actual computation time.

jofrevalles commented 11 months ago

Using Dagger library (in specific Dagger.@spawn) to distribute multiple matrix multiplication tasks, I was able to obtain the following computation times for different number of workers/nodes in MN4.

Cool! Thanks @Todorbsc! Just a couple of things:

Remember that for this kind of plots, it's better to have the y-axis on log scale, so then in the ideal perfect case we would have a straight line.
I think that you are using nodes instead of workers. Here you are using one core per worker, right?

mofeing commented 11 months ago

Nice! Thanks @Todorbsc

Remember that for this kind of plots, it's better to have the y-axis on log scale, so then in the ideal perfect case we would have a straight line.

I think that you are using nodes instead of workers. Here you are using one core per worker, right?

I agree. How many workers per node are you using?

Todorbsc commented 11 months ago

Using Dagger library (in specific Dagger.@spawn) to distribute multiple matrix multiplication tasks, I was able to obtain the following computation times for different number of workers/nodes in MN4.

Cool! Thanks @Todorbsc! Just a couple of things:

Remember that for this kind of plots, it's better to have the y-axis on log scale, so then in the ideal perfect case we would have a straight line.

I think that you are using nodes instead of workers. Here you are using one core per worker, right?

You're right. Although, the number of nodes must be also be in a logarithmic scale for the ideal case to be a decreasing linear function, otherwise if you only set the times (y values) in a logarithmic scale, you would be expecting that the ideal case represented by the linear function has exponential speedups when adding workers.

Sorry, where I wrote "nodes" I meant "workers".

Here's the result in logarithmic scale (binary base) of the previous plot and what would be the ideal case (also, exchange "nodes" for "workers").

mn-48-dynamic-procs-logtime

mofeing commented 11 months ago

Be careful about using the term "exponential speedups" 😂 In the ideal case, you get a linear speedup while increasing the number of workers (i.e. ideally x2 workers => x2 speedup). In practice, you get a sublinear speedup.

bsc-quantic / Tenet.jl

Try Dagger for distributed TN contraction #95