nc::linalg::lstsq very slow

66Ton99 commented 4 years ago

Describe the bug It takes forever to do SVD::decompose()

To Reproduce Run linalg.lstsq with big NdArray, my was 8905 rows and 316 cols

Expected behavior It must be much more faster

mu578 commented 3 years ago

hello, please feel free to scavenge things there, might help you to get rid of bottlenecks. I suspect a too low tol settings

https://github.com/moe123/macadam/tree/master/macadam/details/numa

dpilger26 commented 3 years ago

Thank you @moe123, i'll take a look. I assume you are referring to your SVD implementation here? Do you have any bench marks for this code, like on an 8905x316 matrix like above for instance?

mu578 commented 3 years ago

@dpilger26 not really on that for now, however, parameter description is missing, I should add that. Check your implementation first. You might want to converge with a too low value; or too high expectations, would explain why reduction is very slow and somehow goes to a very high number of iterations for nothing. BTW, by any standard a 8905 x 316 matrix is something very big + beyond the dimension fact; the very nature of data must be known, is it well conditioned?; the OP should reconsider his mathematical approach to solve his problem; maybe not using the right tools here.

Aditya-11 commented 3 years ago

Numpy lstsq function is faster . BenchMark details lstsq of A (201,200) and B (201 , 1) in numpy python it takes 6 seconds and in Numcpp it takes about 98 seconds.

mu578 commented 3 years ago

@Aditya-11 does not mean anything the two underlaying implementations differ by a lot; what are your calling arguments ?especially double inTolerance = 1e-12 if let by default, I am not surprised, which is to me one of the culprit as I early on mentioned briefly + surely this is driven by the pre-conditioning of your data; until, you are not deciding all to be rational this very issue will never be addressed. + How do you compile this very project? which compiler ? vectorization and so on ; iq is low here; that's only the true assertion in term of quantity to be made.

Aditya-11 commented 3 years ago

Thanks @moe123 , I have to check this inTolerance parameter. Compiler clang used . If preconditioning is bad then numpy answer should also come slower I guess .

mu578 commented 3 years ago

@Aditya-11 no numpy implementation differs by a lot (it is a ~20 years old project, so older than you), it also does take account of rcond + select different strategies for decomps + applies residual decimation steps and so on. Preconditioning is never good or bad; numerical analysis is not about pseudo-moralistic casts; but facts; it is what it is contextually; so your guess is as just as good as your previous rant a million thumb-up won't change facts, so whatever, thus, here, we are evidently talking about the logical relationship between a tolerance threshold and the nature of input data for a given algorithm using a SVD solver. The implied tolerance should not exceed something around ~ 1E-7 in the most optimistic scenario within the context of this present implementation; what are your flags ? loop-vectorize ? size. + You all seem so smart that it is too much to ask to upload your actual benchmarks ; until now, the discussion is all about entitled B.S of unverifiable assertions.

Aditya-11 commented 3 years ago

@moe123 , I have changed the argument inTolerance to 1e-6 , still it is much slower compared to python implementation .

Benchmark details : n = 200 , a_1 (200 200) , b_1 (2001) .

in c++ : for (int i = 0 ; i < n ; i++) auto c1 = nc::linalg::lstsq(a_1,b_1,intolerance) ;

time : 100-160s

python : for i in range(n) : c_ = num.linalg.lstsq(a_1 , b_1)

time : 6-8 s

Any other stuff to be changed , I had previously implemented the gmres algorithm in python using numpy and then converted to c++ using Numcpp and it was slower .Therefore I thought that lstsq might be slow .

mu578 commented 3 years ago

@Aditya-11 do you know what's the definition of science? the very definition is a set of methods that given to one can be reproduced by another *. What's your Valgrind output or similar; where does cpu-time happen?

(*) so would you give the content of your data that an apple-to-apple comparison can take place? same remark goes for the OP, I don't know what you are smoking and in which planet do you live on; but, that's far away from here and now; if you would kindly subscribe to reality there would be a tremendous improvement. To address an issue, you need to understand it; meaning literally circumventing it; this is true with anything happening in this world; however, it requires embracing the idea of making an effort; we are not yet there.

-1 Does the fact that numpy runs faster is of any help ? no that's a given reference which we would like very much to reproduce or approach.

-2 Does the complexity of this issue resides on telling one is faster and the other one slower, over and over again; and by such, not exposing the entire problem properly and openly giving it a context that can be shared and anyone can reproduce? no, the complexity will reside when this simple basic step is achieved and resolved; (for now, you failed at it, same goes for the OP); what are the set of rational options we have for improving the actual algorithm.

cbenitez81 commented 3 years ago

Hi @Aditya-11 can you provide the flags used in CMake, it might be something familiar with compilation of the library/executable itself, threading etc. Both numpy and this package relies on lapack which as others mentioned is quite old an stable

dpilger26 / NumCpp

nc::linalg::lstsq very slow #85