Closed cmdcolin closed 3 years ago
wow nice. does TextDecoder need to be polyfilled?
Merging #117 (0d0b147) into master (007c134) will decrease coverage by
1.25%
. The diff coverage is62.50%
.
@@ Coverage Diff @@
## master #117 +/- ##
==========================================
- Coverage 90.01% 88.76% -1.26%
==========================================
Files 7 7
Lines 531 525 -6
Branches 147 148 +1
==========================================
- Hits 478 466 -12
- Misses 53 59 +6
Impacted Files | Coverage Δ | |
---|---|---|
src/tabixIndexedFile.ts | 87.64% <62.50%> (-3.84%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7be09c1...de19707. Read the comment docs.
TextDecoder was polyfilled now in that it is only used if present, otherwise buffer.toString is used
Xref #120
This PR is an option to fix the issue with chunkSizeLimit but instead of canMergeBlock it allows a large block size and uses TextDecoder to unzip
This and canMergeBlock can be used together, yes?
Yep, canMergeBlock would make the blocks smaller so smaller blocks would be handled at a time, but TextDecoder makes the whole process faster no matter the size
There is also an alternative to this, to patch the buffer module in the browser
This was some code that was linked from an issue in the buffer module that can help https://github.com/myfreeer/exceljs/commit/5cd3516ba51f9a338950386469266d402d8a8771
Some work on the canMergeBlock
https://github.com/GMOD/bam-js/pull/68 https://github.com/GMOD/tabix-js/pull/121
the chunk size limit exists to avoid literally hanging or crashing the browser tab. maybe we just need to bump its default value way higher?
It has, historically, caused way too many problems for the end user than help them. I think we have refactored the code to such a state so that it will not cause problems anymore, so I think removal is fine. It should not even have giant chunks now because it only merges chunks up to a limit of 5MB in size
If it is really needed, we can restore it, but I think it would be ok to merge as is
I just added chunkSizeLimit back with a size of 50MB
I was testing some raw tabix parsing performance and saw a couple things, the use case was to parse an entire chromosome up front
Some notes
1) This results in large chunk sizes. We can look into this but I think removing chunkSizeLimit should be done. This is basically an internal condition but it has been constantly exposed to end users and causes frustration 2) buffer.toString(), which is a browserified nodejs buffer, results in a ton of "GC pressure" (see screenshot 1), but changing to TextDecoder(buffer.buffer) results in no GC pressure and is much faster, this is on 11MB of data gzipped, 45MB unpacked in memory buffer, coverted to string 3) The yield while parsing also results in slowdowns when we are interested in raw performance. I don't know the best answer but instead of a number of features timeout it could be a timer based yield
These are small things but when parsing whole-genome datasets like we are these days it makes a difference
This is chr20 only data used for testing
Before change:
After change
This achieves 2x performance improvement (12s vs 6s)