Description

Previously we added a change for InsertionIterator to use a priority queue, as we were experiencing memory errors with the original vector + radixsort method. However, the priority queue solution is slower, so we planned to bring back the vector +radixsort solution once we determined the problem.

Tests

Ran tests in test-atac_utils.R which already had tests for writing bedfiles.
Used r/tests/real_data/ArchR_insertions.R which compares against insertion writing in ArchR, and results remained as we expected.
Ran and profiled using a filtered PBMC dataset of 2600 cells

Details

The problem ended up being pretty simple--just a mistake in the direction of arrows for a conditional within buffer resizing. Even without resizing within the nextChr() call, the iterator still functions at datasets of at least 2600 cells, which indicates that the solution now works. Memory profiling indicates that usage never spiked beyond 1 GB.

I also increased the amount of comments so it is a little easier to see the thought process on how loading fragments works

To recap some discussion offline: I think the direction of comparison was right for the intended purpose -- i.e. resizing the end_data buffer if too small a fraction of it was used in order to avoid re-sorting the same buffer contents repeatedly.

Swapping the comparison direction prevented the over-zealous size increases, but could leave the buffer contents getting sorted many times.

I think the core problem was the comparison with end_data.size() was not the right way to figure out the amount of leftovers, since the logic wouldn't necessarily have filled data all the way out to end_data.size().

The new suggested logic will always fill out the full end_data buffer when possible to avoid this problem

bnprks / BPCells

Ia/insertion radix sorting #134

Description

Tests

Details