GMOD / bbi-js

Parser for bigwig and bigbed files
MIT License
8 stars 6 forks source link

Request: iteration function without allocation #28

Open romgrk opened 5 years ago

romgrk commented 5 years ago

Hey,

So for performance reasons would it be possible to implement an API that doesn't allocate an object for each entry? It could look something like this:

async function fillBuffer() {
  const ti = new BigWig({ path: 'volvox.bw' })
  const header = await ti.getHeader()
  const length = header.refsByNumber[0].length
  const buffer = Buffer.allocate(length)
  await ti.iterate('chr1', 0, length, { scale: 1 }, (start, end, score) => {
    for (let position = start; position < end; position++)
      buffer.writeFloatLE(score, position * 4)
  })
  return buffer
}
cmdcolin commented 5 years ago

I have heard of this technique referred to as "bring your own buffer"...It may be possible to do this. Do you have significant evidence of the performance degradation?

I can see you are using scale: 1 so that would probably be intensive across the whole length of a chromosome. You could consider using one of the other reductionLevels to make this involve less data, probably would be faster, but if you require the lowest scale then I can see that would be resource intensive.

romgrk commented 5 years ago

Yes, we indeed require to use scale: 1. We're converting bigWig files into loompy files for an implementation of the ga4gh-rnaseq API, and we need to fill a buffer with every value, and this is for multiple tracks at once so we're filling lots of buffers with lots of entries. The API also allows for returning multiple tracks of the whole bigWig file, I'm pretty sure that any saved allocation can decrease the memory-cost of the process. (speed is not an issue though, I meant performance in terms of memory)

cmdcolin commented 5 years ago

I'll just ask a couple more questions

  1. Is there anything particular about this library (bbi-js) that makes this particularly well suited to your app? Do you do these conversions on the fly? I am imagining for a large data ingestion, I would probably just convert to bedgraph or regular wig and stream that into your data warehouse/hdf5/loompy
  2. Do you have interest in implementing this yourself? I had another request similar to this here https://github.com/GMOD/generic-filehandle/issues/20 and I'd love to see progress on it but until it becomes a bottleneck for my use cases (primarily genome browser apps) it's hard for me to push it into the priority queue
cmdcolin commented 5 years ago

The purpose is compatibility with https://github.com/romgrk/node-loompy ? so it keeps it all in the js ecosystem?

romgrk commented 5 years ago

Yes, we're using that module that we wrote to keep it all in JS.

For your points:

  1. We have tons (not sure how many but >10,000, maybe >100,000) of bigWig tracks and those files are provided to us in that format and we need them that way for other purposes, so it's not practical to convert them and have them in both formats as we would run out of space.
  2. Sure, I'll try to find some time and open a PR