zero copy, 64-bit ptr_type and 32-bit col_type

BrunoLevy commented 2 years ago

Hi, I'm using AMGCL for reconstruction in cosmology (large unstructured Poisson systems), and it works extremely well (so congratulations and thank you for this very nice piece of software). My matrix representation is optimized for large systems (10^8 times 10^8 with 16 NNZ per row in average), so I'm representing it using 32-bit indices for the col array, and 64-bit indices for the ptr array. I'm using the zero_copy adpater, that requires all indices to be 64 bits (so I need to copy the col array, which represents 13 GBytes of data). Is there a way to configure AMGCL to use 32 bit col indices ? I have seen that the internal crs matrix representation is templated by value type, col index type and ptr index type, and that the builtin backend uses the same type for col index and ptr index. I have tried to modify the builtin backend (with no success, it compiled but crashed at execution time), do you have any advice ?

ddemidov commented 2 years ago

I am sorry, but there is no easy solution for this currently. The builtin backend hard-codes the index_type for the internal matrices to be ptrdiff_t (64 bit), and the zero-copy adapter is only able to skip the copy by directly using the pointers to the CRS arrays, so these should be binary-compatible with builtin matrices.

It should be possible to parameterize the builtin backend not only on the value type, but also on pointer/column types, but it will require careful testing of the changes, and I am not sure how difficult would that be.

ddemidov commented 2 years ago

This example of using the builtin backend with mixed index types seems to work with 87a6c63. Please see if it also works for you.

BrunoLevy commented 2 years ago

Thanks ! (wow, that was fast !) I have tried, but with my examples it crashes. Valgrind indicates an invalid 8-bytes memory read in spgemm_saad(). I checked in this function that Col is 4 bytes and Ptr 8 bytes (by printing sizeof(Col) and sizeof(Ptr) to stdout). It may come from somewhere else in my code (still investigating...).

ddemidov commented 2 years ago

Can you check if your example works without using the zero_copy_direct adapter (with std::tie(n, ptr, col, val))? If that is the case, please make sure your matrix is sorted column-wise within each row. You can do this with

auto A = amgcl::adapter::zero_copy_direct(...);
amgcl::backend::sort_rows(*A);

If your example still crashes after that, can you reduce your system so that it is small enough to post here but the issue is still reproducible, and share it in matrix market format?

BrunoLevy commented 2 years ago

It does not compile with std::tuple<>, which makes me think that my types declaration is not correct.

I am using:

typedef amgcl::backend::builtin<double, uint32_t, uint64_t> Backend;

typedef amgcl::amg<
    Backend,
    amgcl::runtime::coarsening::wrapper,
    amgcl::runtime::relaxation::wrapper
> AMG;
typedef amgcl::runtime::solver::wrapper<Backend>  ISolver;
typedef amgcl::make_solver<AMG, ISolver>          Solver;

typedef boost::property_tree::ptree               Params;

/*
    auto M_amgcl = std::tie(
    nn, M->rowptr, M->colind, M->val
    );
*/

    auto M_amgcl = amgcl::adapter::zero_copy_direct(
    size_t(n), M->rowptr, M->colind, M->val
    );

    amgcl::backend::sort_rows(*M_amgcl);

    Solver solver(M_amgcl,prm);
    int used_iterations;
    double residual;
    std::tie(used_iterations, residual) = solver(
    boost::make_iterator_range(b, b + n),
    boost::make_iterator_range(x, x + n)
    );

With std::tie(), it does not compile:

/home/blevy/Programming/warpdrive/src/../third_party/amgcl/amgcl/backend/interface.hpp:108:18: error: ‘const class std::tuple<long unsigned int&, long unsigned int*&, unsigned int*&, double*&>’ has no member named ‘cols’
  108 |         return A.cols();

With zero_copy_direct it crashes in Solver constructor. Investigating with valgrind, it does 8 bytes invalid reads in spgemm_saad(), which lets me think that a part of the types are still instanciated with 8 bytes integers for COL.

Works zero_copy using same ptr and val, and col copied from my representation (32-bit) to a temporary vector with 64-bit integers.

ddemidov commented 2 years ago

The tuple should contain ranges (instances of std::vector or amgcl::iterator_range), not pointers to the data. So try with

auto M_amgcl = std::tie(n,
    amgcl::make_iterator_range(M->rowptr, M->rowptr + n + 1),
    amgcl::make_iterator_range(M->colind, M->colind + M->rowptr[n]),
    amgcl::make_iterator_range(M->val,    M->val    + M->rowptr[n])
    );

BrunoLevy commented 2 years ago

It still complains:

/home/blevy/Programming/warpdrive/src/../third_party/amgcl/amgcl/backend/interface.hpp:99:18: error: ‘const class std::tuple<long unsigned int&, amgcl::iterator_range<long unsigned int*$
   99 |         return A.rows();

/home/blevy/Programming/warpdrive/src/../third_party/amgcl/amgcl/backend/interface.hpp:172:43: error: no type named ‘row_iterator’ in ‘class std::tuple<long unsigned int&, amgcl::iterat$
  172 |     typedef typename Matrix::row_iterator type;

home/blevy/Programming/warpdrive/src/../third_party/amgcl/amgcl/backend/builtin.hpp:133:44: error: no matching function for call to ‘row_begin(const std::tuple<long unsigned int&, amgc$
  133 |             for(auto a = backend::row_begin(A, i); a; ++a) ++row_width;

ddemidov commented 2 years ago

Did you include <amgcl/adapter/crs_tuple.hpp>?

ddemidov commented 2 years ago

My example above uses the same sizes for column indices and row pointers as you do, so the problem with still using incorrectly sized data in spgemm_saad seems unlikely. valgrind does not complain in my case.

BrunoLevy commented 2 years ago

OK, then I'll try to start from the example and "morph" it into my code, I probably have a declaration that is wrong somewhere !

It compiles with std::tie, and crashes the same way as with zero_copy_direct (note: sort_rows() does not compile and complains about missing operator *).

ddemidov commented 2 years ago

Which version of the matrix did you send to the sort_rows()? It should be the one coming out from the zero_copy_direct. I have just updated the example so that it uses sort_rows(), and it does compile.

ddemidov commented 2 years ago

But, if it crashes with std::tie() then the problem should not be with sorting (the matrix comping from std::tuple is copied into an internal representation and sorted afterward). Can you share a smaller system here or via email?

EDIT: You can use amgcl::io::mm_write("A.mtx", A) to save the matrix into the matrix market format.

BrunoLevy commented 2 years ago

Yep, I'll do that ! Thanks !

BrunoLevy commented 2 years ago

Found out ! I was using unsigned integers (uint32_t, uint64_t), I just tried with int and ssize_t, and it works !

ddemidov commented 2 years ago

I see, it should be fixed in 8e669cda7a586b7bed5770c8d8052c38f027d0a8.

BrunoLevy commented 2 years ago

I may be wrong, but I think that ja should be a Ptr as it was before (indexing arrays in [0..NNZ-1] range), not an Idx (indexing arrays in [0..N-1] range) (and same thing for the other changes)

ddemidov commented 2 years ago

Idx is defined as ptrdiff_t, so it should be enough. Using Ptr could be slightly more efficient in cases where it is 32bit, but the marker array has to be signed for the algorithm to work, and row_end has to be signed as well for this comparison to work.

BrunoLevy commented 2 years ago

Ah OK ! (I thought that Idx was the type of the col array entries), thanks ! I'll send the update to my colleague with the 100Mx100M system and tell you the performance gain (I know it will save more than 10GB of RAM, and regarding timings my early tests show that there will be a non-negligible gain as well). A big THANK YOU for your wonderful code and for your help.

ddemidov commented 2 years ago

Do you also use a GPU backend with the code? These changes are not yet in the vexcl or cuda backends. It should be possible to do for the vexcl one, but probably not for the cuda one, since the cuda backend uses standard cusparse matrices.

BrunoLevy commented 2 years ago

For the GPU, not yet but I plan to try it. Note: my systems are in general too big to fit in GPU RAM though (but I need to try mixed-precision maybe). And for now, your code is so fast that the overall time spent in it becomes negligible as compared to the rest ! (before then, with my simple Jacobi-preconditioned CG, it was the lion's share). My bottleneck is now on the geometric part (need to compute a large number of Voronoi/Laguerre diagrams). I'm currently testing a different algorithm for that.

ddemidov commented 2 years ago

Mixed precision could also help with the builtin backend performance.

I'll see if it is possible to update the vexcl backend and push the changes to the master branch.

ddemidov commented 2 years ago

I have updated the vexcl backend so that it is possible to specify index/pointer types for its matrices, squashed the commits from mixed-index-types branch and pushed the result to master.

The updated example from above: https://gist.github.com/ddemidov/dd25fb4a51be5b3dc2f78c0870759b1e

BrunoLevy commented 2 years ago

Just tested master: it works perfectly ! (did not test the GPU backend yet).

BrunoLevy commented 5 months ago

Hi Denis, Just to mention our article on modified gravity here that cites AMGCL (and you are in the acknowledgements !). Thank you very much for this great library and for your help. Cheers, -- Bruno

ddemidov commented 4 months ago

Hi Bruno,

Thank you for letting me know and for the aknowlegement! And congratulations on the paper, it looks awesome :)

ddemidov / amgcl

zero copy, 64-bit ptr_type and 32-bit col_type #216