NNPDF / pineappl

PineAPPL is not an extension of APPLgrid
https://nnpdf.github.io/pineappl/
GNU General Public License v3.0
12 stars 3 forks source link

Add new type `PackedArray` #275

Closed janw20 closed 2 months ago

cschwan commented 2 months ago

PackedArray is possibly a bad name and that would be my fault. If anyone can think of a better name let us know!

The idea is to have a similar interface as the array class from ndarray, but not store zeros to keep down the file size, so PackedArray<T, D> will be the successor ofSparseArray3<T>. The new type will compress a bit better and more importantly it supports arbitrary dimensions.

alecandido commented 2 months ago

The idea is to have a similar interface as the array class from ndarray, but not store zeros to keep down the file size, so PackedArray<T, D> will be the successor ofSparseArray3<T>. The new type will compress a bit better and more importantly it supports arbitrary dimensions.

Yes, I just read the first part, and it was not obvious that it was the sparse array, sorry.

Skimming through it, there were no docs, so I made the wrong inference. I deleted the comment right after, as soon as I've seen the Index and IndexMut implementation.

cschwan commented 2 months ago

No worries! We're happy you're interested and that you're looking at the code.

alecandido commented 2 months ago

Btw: what about just using SparseArray? (w/o the 3)

cschwan commented 2 months ago

The adjective 'sparse' never made sense, because when you'd have to decide whether the arrays were sparse or dense, they're clearly dense.

Another way of explaining it is that the data structures for representing sparse matrices would clearly be inefficient for our use cases and in fact need much more memory. There's usually a connected subset of the array that is non-zero, and when you integrate long enough the this subset should even be simply connected.

alecandido commented 2 months ago

From your comment, I'd say that it is an optimized strategy for the sparse matrix representation (as there is not a single best strategy). I.e., your matrix has a relevant amount of default elements, and you could compress your representation by avoiding writing down all elements.

However, names are relevant but not crucial. So, if packed sounds more appropriate, just keep that :)

cschwan commented 2 months ago

From your comment, I'd say that it is an optimized strategy for the sparse matrix representation (as there is not a single best strategy). I.e., your matrix has a relevant amount of default elements, and you could compress your representation by avoiding writing down all elements.

That's basically it, we first map the D-dimensional indices to one-dime sional and then we store the non-zero (non-defaulted) elements a linearized vector. For zeros we just save where they are and how many of them are consecutive. Finally there's an exception where it makes sense to explicitly store zeros, however in practice it shouldn't make a big difference.

janw20 commented 2 months ago

After looking into the sparse matrix Wikipedia article, I think we're doing a kind of band matrix (at least for D=2), which (according to Wikipedia) is a sparse matrix. So I think I would be fine with calling it SparseArray, also because that is more understandable and consistent with SparseArray3

janw20 commented 2 months ago

I wrote 4 benchmarks with the criterion crate where 500 random elements are inserted into a SparseArray3 or a PackedArray of shapes [40, 50, 30] or [10, 5, 7], and it seems like the performances of both are very similar, with PackedArray being about ~5-10% faster, depending on the RNG seed

felixhekhorn commented 2 months ago

just for the sake of knowledge transfer: I think I understand the general idea (you're saving coordinates), but index_mut could only profit from some more comments :see_no_evil: (as this is clearly a crucial part)

janw20 commented 2 months ago

Hi Felix, do you mean /// documentation for the index_mut function? Or just in-line comments? Also, how is this normally done, should I commit to this pull request again, and afterward it will be again rebased? Or should I open a new branch on the main repo?

cschwan commented 2 months ago

@janw20: I believe @felixhekhorn wants a better description of what's going inside index_mut. In that case document it with // inside the method, because the publicly available documentation (the one marked with ///) should only explain what the method does, not how. You can commit straight to master, let me know if you have permission problems.

janw20 commented 2 months ago

I improved the documentation and committed it in https://github.com/NNPDF/pineappl/commit/6a711bbd347a0f1d090ddac043ce0358cc5fa45d to master.