GPU and alternative solver algorithms

jlperla commented 5 years ago

@matthieugomez I was discussing your package with @schrimpf and we were curious about whether GPUs could help solving these multi-way fixed effect models.

It seems like your LSMR i https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/utils/lsmr.jl is an iterative method and there is at least a decent chance the kernel could be written for the GPU? Have you thought about this?

Also, am I wrong in thinking that this is the core of the algorithm and everything else is just setting up the appropriate matrix-free linear map?

matthieugomez commented 5 years ago

The core of the algorithm is the At_mul_B! part in the LSMR, which, for a matrix of fixed effects, basically requires to compute the mean of a variable within groups defined by fixed effects. This is done by the function helperC here: https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/FixedEffectMatrix/LSMR.jl#L136

Now, I'm not sure the function helperC would gain a lot from GPU. It is is typically memory bound rather than CPU bound — there is little gain from parallelizing or multi-threading. But it is a simple function so it should be easy to experiment.

jlperla commented 5 years ago

Thanks @matthieugomez We will think about this, and it might make a good project for someone! Another question: I wasn't sure if this was using matrix-free methods in the iteration rather or was built for CSC sparse matrices.

Finally, Raffa Saggio (@rsaggio87 ) and I were discussing algorithms for two-way fixed effects, and for his application he found that a combinatorial multigrid solver (https://www.cs.cmu.edu/~jkoutis/cmg.html ) was a game-changer. Have you come across that before, and do you have a sense on whether that solver and preconditioning approach might apply to general problems?

matthieugomez commented 5 years ago

Yes, it is matrixfree. Using CSC sparse matrices would be much slower (this is because FixedEffects Matrix have a particular structure so that A * x can be done in one pass).

Yes I think @sergiocorreia told me about this algorithm. I think he had a hard time coding it in Mata (at least five years ago), so I never really tried. But it should not be hard to do now that the Matlab code is available. I've not looked at it in detail so I'm not sure whether it can be generalized to more than two fixed effects.

jlperla commented 5 years ago

Great, just checking on the matrix-free. But it doesn't sound like left multiplies over big vectors with some sort of convolution (where GPUs shine) is the crux. We may see if a student wants to look at a within group means with GPUs at some point this year.

For the CMG algorithm, it seems like (1) most of the code is in C for a MEX so could be reorganized and deployed with binary builder but (2) like most matlab code there is no associated open-source license. Maybe I will talk to the authors to see if they know of newer implementations or a matlab-free implementation which would be easier to build binaries for.

matthieugomez commented 5 years ago

Seem like this https://github.com/danspielman/Laplacians.jl could be helpful too.

jlperla commented 5 years ago

Interesting. Maybe worth investigating for a student.

jlperla commented 5 years ago

@rsaggio87 if you have a minute, take a look at https://danspielman.github.io/Laplacians.jl/latest/ to see if you see a comparable algorithm to what you are using?

rsaggio87 commented 5 years ago

Hi! I just looked into https://danspielman.github.io/Laplacians.jl/latest/. It has exactly the routine that I was mentioning and that I am using for my project, see

KMPLapSolver and KMPSDDSolver: linear equation solvers based on the paper "Approaching optimality for solving SDD systems" by Koutis, Miller, and Peng, SIAM Journal on Computing, 2014.

sergiocorreia commented 5 years ago

Hi Matthieu, Raffaele, Jesse,

I was not aware of Dan Spielman's code, this is absolutely amazing.

That said, from my experience with this paper, there are some practical difficulties with this approach that might make it a bit too slow, even independently of implementation concerns.

I would love to update my beliefs on this though, so if you or any of your students learn anything on this topic, definitely do let me know.

On a related topic, I have around a dozen of real life large datasets (shareable and anonymized when needed) that I used to benchmark reghdfe and that can be useful when testing any algorithm. For instance, the figure below shows the adjacency matrix of an employer-employee dataset:

V2_workers

And this one has CEOs-firms:

V1_directors

And for comparison this is how most synthetic datasets would actually look:

V2_synthetic-complete

If you are interested, I can upload these datasets into my website or to github. It would just be something equivalent to UF's Matrix database which many researchers in applied math use to test their algorithms.

Best, Sergio

jlperla commented 5 years ago

@sergiocorreia Thanks so much for the response, we will analyze it. Are there any other state-of-the-art algorithms with linear algebra that are worth investigating for addition into this package? Do you guys feel that preconditioners have been adequately studied, for example? It is a lot easier to test out algorithms in Stata/Mata,

sergiocorreia commented 5 years ago

Do you guys feel that preconditioners have been adequately studied, for example?

Not really. You can probably get quite a few of the improvements that spectral sparsifiers bring by using good preconditioners. If you go through the papers, you'll probably see that some of the steps are quite easy to do compared to the others.

For instance, in an employer-employee panel, suppose that an individual has worked at the same firm for all his life. To solve a Laplacian (and more generally to partial out two-way fixed effects) you can actually ignore his observations (and the ones for all the individuals that were at a single-firm in the sample), then solve a smaller-and-faster system of equations, and finally back out the fixed effects of the individuals you initially ignored.

There are a few shortcuts like these that really speed things up in both worst-case-examples as well as real scenarios, and are relatively easy to implement (and explain).

schrimpf commented 5 years ago

I hacked together some CUDA support for lsmr, see https://github.com/schrimpf/FixedEffects.jl/tree/gpu

On my hardware (Xeon E5-2640 and GeForce GTX 1060), for the first benchmark in benchmarks.jl, I get a 20x speedup on the GPU.

julia> include("benchmark.jl")
CPU
  531.754 ms (214 allocations: 6.78 KiB)
GPU 
  25.190 ms (1613 allocations: 59.11 KiB)
julia> cx ≈ gx
true

The code was mostly a matter of passing around CuArrays in place of Arrays. I made some changes to the type definitions in FixedEffectLinearMap.jl to facilitate this, and it is likely some of the changes are not ideal. I did have to write CUDA kernels for mean! and demean!. The implementation for mean! is very simple, and I expect will perform poorly when the number of categories is small relative to the number of concurrently running GPU threads. This could be improved with a better GPU reduction algorithm.

jlperla commented 5 years ago

@schrimpf Amazing. We should try it out on the cluster GPU! This looks like a great project for some students to take on for their end of year project.

matthieugomez commented 5 years ago

That's impressive. Feel free to make a pull request, I will merge it as is as long it passes the tests.

sergiocorreia commented 5 years ago

This is amazing! From a non-julia PoV, it also means that if I hook up Stata to scipy's implementation of LSMR, it might then also work with CUDA

matthieugomez commented 5 years ago

Ok, I have pushed a version of your code that adds :lsmr_gpu as an option. The syntax is

using CuArrays, FixedEffects
solve_residuals!(...., method = :lsmr_gpu)

With FixedEffectModels:

using CuArrays, FixedEffectModels
reg(...., method = :lsmr_gpu)

I cannot test this method in the tests unfortunately, since Travis does not work with gpu.

schrimpf commented 5 years ago

Cool. I'll test locally, and look into making tests that work with Travis using the CuTestArray approach in CUDAnative.jl

matthieugomez commented 5 years ago

Yes, let me know! One difference with your code is that I convert everything to a Float32, since some GPUs don't support Float64.

matthieugomez commented 5 years ago

Ok. I just tagged a new version of the package with CUDA support. Thanks a lot to Jesse and Paul, both for pushing the idea and for helping me with the code.

I opened another issue for the Laplacian solver.

FixedEffects / FixedEffects.jl

GPU and alternative solver algorithms #9