GenomicsDB / GenomicsDB-R

Experimental R bindings to the native GenomicsDB library
GNU General Public License v2.0
4 stars 2 forks source link

Initial dataframe support for query_variant_calls #9

Closed nalinigans closed 4 years ago

nalinigans commented 4 years ago

Added initial support for returning data frames via Rcpp. Example data frame -

  ROW   COL  SAMPLE CHROM   POS   END REF            ALT  DP  GT
1   0 12140 HG00141     1 12141 12295   C    [<NON_REF>]  NA 0/0
2   1 12144 HG01958     1 12145 12277   C    [<NON_REF>]  NA 0/0
3   0 17384 HG00141     1 17385 17385   G [A, <NON_REF>]  NA 0/1
4   1 17384 HG01958     1 17385 17385   G [T, <NON_REF>] 120 1/1
5   2 17384 HG01530     1 17385 17385   G [A, <NON_REF>]  76 0/1
6   0 17384 HG00141     1 17385 17385   G [A, <NON_REF>]  NA 0/1
7   1 17384 HG01958     1 17385 17385   G [T, <NON_REF>] 120 1/1
8   2 17384 HG01530     1 17385 17385   G [A, <NON_REF>]  76 0/1

Most of the columnar vector construction will be pushed into the C/C++ layer, so other language bindings can take advantage when building data frames.

nalinigans commented 4 years ago

Yes, I relegated the old behavior to query_variant_calls_ by_interval. And the new version of query_variant_calls does merge all the query intervals. We could have a keep_interval option defaulting to false to query_variant_calls. This would just either add the interval column to the data frame or created a nested data frame with query intervals at the top level. This could entirely replace query_variant_calls_by_interval.