GMOD / tabix-js

Read Tabix-indexed files, either with .tbi or .csi indexes, in node or the browser
MIT License
14 stars 5 forks source link

Use TextDecoder and yield based on time instead of number of lines #117

Closed cmdcolin closed 3 years ago

cmdcolin commented 4 years ago

I was testing some raw tabix parsing performance and saw a couple things, the use case was to parse an entire chromosome up front

Some notes

1) This results in large chunk sizes. We can look into this but I think removing chunkSizeLimit should be done. This is basically an internal condition but it has been constantly exposed to end users and causes frustration 2) buffer.toString(), which is a browserified nodejs buffer, results in a ton of "GC pressure" (see screenshot 1), but changing to TextDecoder(buffer.buffer) results in no GC pressure and is much faster, this is on 11MB of data gzipped, 45MB unpacked in memory buffer, coverted to string 3) The yield while parsing also results in slowdowns when we are interested in raw performance. I don't know the best answer but instead of a number of features timeout it could be a timer based yield

These are small things but when parsing whole-genome datasets like we are these days it makes a difference

This is chr20 only data used for testing

Before change: Screenshot from 2020-08-06 12-42-46

After change Screenshot from 2020-08-06 12-40-38

This achieves 2x performance improvement (12s vs 6s)

rbuels commented 4 years ago

wow nice. does TextDecoder need to be polyfilled?

codecov[bot] commented 4 years ago

Codecov Report

Merging #117 (0d0b147) into master (007c134) will decrease coverage by 1.25%. The diff coverage is 62.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
- Coverage   90.01%   88.76%   -1.26%     
==========================================
  Files           7        7              
  Lines         531      525       -6     
  Branches      147      148       +1     
==========================================
- Hits          478      466      -12     
- Misses         53       59       +6     
Impacted Files Coverage Δ
src/tabixIndexedFile.ts 87.64% <62.50%> (-3.84%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7be09c1...de19707. Read the comment docs.

cmdcolin commented 4 years ago

TextDecoder was polyfilled now in that it is only used if present, otherwise buffer.toString is used

cmdcolin commented 3 years ago

Xref #120

This PR is an option to fix the issue with chunkSizeLimit but instead of canMergeBlock it allows a large block size and uses TextDecoder to unzip

rbuels commented 3 years ago

This and canMergeBlock can be used together, yes?

cmdcolin commented 3 years ago

Yep, canMergeBlock would make the blocks smaller so smaller blocks would be handled at a time, but TextDecoder makes the whole process faster no matter the size

There is also an alternative to this, to patch the buffer module in the browser

This was some code that was linked from an issue in the buffer module that can help https://github.com/myfreeer/exceljs/commit/5cd3516ba51f9a338950386469266d402d8a8771

cmdcolin commented 3 years ago

Some work on the canMergeBlock

https://github.com/GMOD/bam-js/pull/68 https://github.com/GMOD/tabix-js/pull/121

rbuels commented 3 years ago

the chunk size limit exists to avoid literally hanging or crashing the browser tab. maybe we just need to bump its default value way higher?

cmdcolin commented 3 years ago

It has, historically, caused way too many problems for the end user than help them. I think we have refactored the code to such a state so that it will not cause problems anymore, so I think removal is fine. It should not even have giant chunks now because it only merges chunks up to a limit of 5MB in size

cmdcolin commented 3 years ago

If it is really needed, we can restore it, but I think it would be ok to merge as is

cmdcolin commented 3 years ago

I just added chunkSizeLimit back with a size of 50MB