biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
563 stars 105 forks source link

sambamba view filter syntax #216

Closed nh3 closed 6 years ago

nh3 commented 8 years ago

Would it be possible to allow comparison between fields/tags? This would be very useful for things like 'ref_id != mate_ref_id'. Thanks.

lomereiter commented 8 years ago

Hi, I'm not sure if such a general approach is really useful. Do you have any other use cases in mind? This one is already covered by poorly documented chimeric flag (https://github.com/lomereiter/sambamba/issues/185)

nh3 commented 8 years ago

There will be plenty of use cases if simple arithmetics are allowed at the same time, e.g. '[NM] < sequence_length/30', but I guess it might be too involved to implement. The chimeric flag is really useful. Thanks!

lomereiter commented 8 years ago

Alright, I've looked into Duktape today, integrating a JS interpreter is going to be much easier than modifying the current parser.

Minimal working example looks simple enough:

import bio.bam.reader;
import duktape; // extern(C) declarations from duktype.h

import std.stdio;

extern(C) duk_ret_t
read_sequence_length(duk_context ctx) {
  auto read = cast(BamRead*)duk_get_pointer(ctx, 0);
  duk_push_int(ctx, read.sequence_length);
  return 1;
}

void main(string[] args) {
  auto read = new BamReader(args[1]).reads.front;

  duk_context ctx = duk_create_heap_default();
  scope(exit) duk_destroy_heap(ctx);

  // register the function globally
  duk_push_c_function(ctx, &read_sequence_length, 1);
  duk_put_global_string(ctx, "sequenceLength");

  duk_eval_string(ctx, "(function(r) {
       return sequenceLength(r) > 40;
  })");                         // stack: [func]

  duk_push_pointer(ctx, &read); // stack: [func bamread]
  duk_call(ctx, 1);             // stack: [result]

  duk_bool_t result = duk_get_boolean(ctx, -1);
  duk_pop(ctx);

  writeln("filter result:", to!bool(result));
}
nh3 commented 8 years ago

It's amazing that you come up with a solution so quickly! Speed-wise, how do you expect it to compare with the current parser or piping into awk?

lomereiter commented 8 years ago

Actually I thought of using Lua/LuaJIT long ago, but it just feels weird to use it outside gamedev. JS is more familiar but until recently the choice was between monstrous SpiderMonkey/V8.

Speed-wise it should be somewhere in the middle, I'd expect closer to the current parser. Awk introduces pretty large overhead of serializing and then parsing again, including fields that don't even participate.

lomereiter commented 8 years ago

Update: Duktape proved to be too slow, albeit faster than AWK. I may look into LuaJIT later.

Here's another interesting project which uses LLVM JIT: https://github.com/BoutrosLaboratory/bamql http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1162-y