GMOD / tabix-js

Read Tabix-indexed files, either with .tbi or .csi indexes, in node or the browser

MIT License

14 stars 5 forks source link

Use TextDecoder and yield based on time instead of number of lines #117

Closed cmdcolin closed 3 years ago

cmdcolin commented 4 years ago

I was testing some raw tabix parsing performance and saw a couple things, the use case was to parse an entire chromosome up front

Some notes

1) This results in large chunk sizes. We can look into this but I think removing chunkSizeLimit should be done. This is basically an internal condition but it has been constantly exposed to end users and causes frustration 2) buffer.toString(), which is a browserified nodejs buffer, results in a ton of "GC pressure" (see screenshot 1), but changing to TextDecoder(buffer.buffer) results in no GC pressure and is much faster, this is on 11MB of data gzipped, 45MB unpacked in memory buffer, coverted to string 3) The yield while parsing also results in slowdowns when we are interested in raw performance. I don't know the best answer but instead of a number of features timeout it could be a timer based yield

These are small things but when parsing whole-genome datasets like we are these days it makes a difference

This is chr20 only data used for testing

Before change: Screenshot from 2020-08-06 12-42-46

After change Screenshot from 2020-08-06 12-40-38

This achieves 2x performance improvement (12s vs 6s)

rbuels commented 4 years ago

wow nice. does TextDecoder need to be polyfilled?

codecov[bot] commented 4 years ago

Codecov Report

Merging #117 (0d0b147) into master (007c134) will decrease coverage by 1.25%. The diff coverage is 62.50%.

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
- Coverage   90.01%   88.76%   -1.26%     
==========================================
  Files           7        7              
  Lines         531      525       -6     
  Branches      147      148       +1     
==========================================
- Hits          478      466      -12     
- Misses         53       59       +6

Impacted Files	Coverage Δ
src/tabixIndexedFile.ts	`87.64% <62.50%> (-3.84%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7be09c1...de19707. Read the comment docs.

cmdcolin commented 4 years ago

TextDecoder was polyfilled now in that it is only used if present, otherwise buffer.toString is used