bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

Ia/insertion radix sorting #134

Open immanuelazn opened 1 month ago

immanuelazn commented 1 month ago

Description

Previously we added a change for InsertionIterator to use a priority queue, as we were experiencing memory errors with the original vector + radixsort method. However, the priority queue solution is slower, so we planned to bring back the vector +radixsort solution once we determined the problem.

Tests

Details

The problem ended up being pretty simple--just a mistake in the direction of arrows for a conditional within buffer resizing. Even without resizing within the nextChr() call, the iterator still functions at datasets of at least 2600 cells, which indicates that the solution now works. Memory profiling indicates that usage never spiked beyond 1 GB.

I also increased the amount of comments so it is a little easier to see the thought process on how loading fragments works

bnprks commented 1 month ago

To recap some discussion offline: I think the direction of comparison was right for the intended purpose -- i.e. resizing the end_data buffer if too small a fraction of it was used in order to avoid re-sorting the same buffer contents repeatedly.

Swapping the comparison direction prevented the over-zealous size increases, but could leave the buffer contents getting sorted many times.

I think the core problem was the comparison with end_data.size() was not the right way to figure out the amount of leftovers, since the logic wouldn't necessarily have filled data all the way out to end_data.size().

The new suggested logic will always fill out the full end_data buffer when possible to avoid this problem