Open immanuelazn opened 1 month ago
To recap some discussion offline: I think the direction of comparison was right for the intended purpose -- i.e. resizing the end_data buffer if too small a fraction of it was used in order to avoid re-sorting the same buffer contents repeatedly.
Swapping the comparison direction prevented the over-zealous size increases, but could leave the buffer contents getting sorted many times.
I think the core problem was the comparison with end_data.size()
was not the right way to figure out the amount of leftovers, since the logic wouldn't necessarily have filled data all the way out to end_data.size()
.
The new suggested logic will always fill out the full end_data
buffer when possible to avoid this problem
Description
Previously we added a change for
InsertionIterator
to use a priority queue, as we were experiencing memory errors with the original vector + radixsort method. However, the priority queue solution is slower, so we planned to bring back the vector +radixsort solution once we determined the problem.Tests
test-atac_utils.R
which already had tests for writing bedfiles.r/tests/real_data/ArchR_insertions.R
which compares against insertion writing in ArchR, and results remained as we expected.Details
The problem ended up being pretty simple--just a mistake in the direction of arrows for a conditional within buffer resizing. Even without resizing within the
nextChr()
call, the iterator still functions at datasets of at least 2600 cells, which indicates that the solution now works. Memory profiling indicates that usage never spiked beyond 1 GB.I also increased the amount of comments so it is a little easier to see the thought process on how loading fragments works