Produce full set of binary files for megastudy

VEuPathDB / lib-eda-subsetting

Provides Java interface to query and provide EDA data and metadata from a database

Apache License 2.0

0 stars 0 forks source link

Produce full set of binary files for megastudy #24

Closed dmgaldi closed 1 year ago

dmgaldi commented 1 year ago

Three issues found:

Incorrect allocation of utf-8 encoded floating point variable values. Fixed by https://github.com/VEuPathDB/lib-eda-subsetting/commit/f4b9bbfca3ea1a5607d473a56652d02212da604a
Binary file metadata not being generated in file dumper, fixed by https://github.com/VEuPathDB/lib-eda-subsetting/commit/ab39a5df322b0dc4345654297e6de116b0162454
Entity entries with multiple ancestors, awaiting @jbrestel to remove the offending studies and reload the megastudy.

bobular commented 1 year ago

Curious how floating point values need utf-8 encoding?

dmgaldi commented 1 year ago

This is an optimization for outputting the variable. For filtering, we use a binary floating point representation that can be easily deserialized into a Java float.

If a client requests to output a floating point variable, the tabular output is encoded as utf-8 strings. It's somewhat expensive to convert a Java float into a string, so have the utf-8 string representations pre-computed in another file alongside the binary floating point file.

dmgaldi commented 1 year ago

There's a world where we could have a binary application/octet version of the tabular endpoint so we don't have to worry about utf-8, but that would require all consumers to understand our binary format, whereas right now it's all encapsulated in subsetting service.

dmgaldi commented 1 year ago

This led to discovery of two other issues:

Binary file metadata not being generated in file dumper, fixed by https://github.com/VEuPathDB/lib-eda-subsetting/commit/ab39a5df322b0dc4345654297e6de116b0162454
Entity entries with multiple ancestors, tracked by https://redmine.apidb.org/issues/47709/?journals=all#note-171428

dmgaldi commented 1 year ago

Entity entries with multiple ancestors is currently awaiting @jbrestel to remove the offending studies and reload the megastudy.

dmgaldi commented 1 year ago

Files are in good shape now on yew.