New Feature: simplify curation of annotation transcripts

roland-ewald commented 6 years ago

Problem

The current strategy of serializing JannovarData makes it hard to manually edit the available transcript data. For example, to improve annotation speed and memory footprint one could imagine to only use a certain subset of the default transcript sets, or to include certain historic (but now out-dated) transcript versions for validation and comparison to previous data sets. One could also imagine mixing transcripts from different sources, e.g. RefSeq and Ensembl, to give a more complete picture (though I am not sure Jannovar can deal with this).

As a nice side effect, this should make integration-testing Jannovar much easier and faster, as it now only requires a text-based representation of the transcript data for a handful of transcripts: these can be converted to a JannovarData object without downloading additional files, and Jannovar's output for variants in these regions should be the same as with more transcripts.

Potential Solution

I suggest to add two new commands to jannovar-cli:

export: reads a given *.ser file and writes a text-based representation of all included data into a file (e.g. JSON, XML, or a set of CSV files).
import: reads the files in the same format as they are produced by export and generates a serialized JannovarData file from that, which can be used for annotation.

I'm currently working on code that does this in an ad-hoc manner and would be happy to contribute a PR for this, as this could be useful in many other settings as well. I am open to suggestions regarding the specific design (file format, how to name the commands, etc).

visze commented 6 years ago

+1

I guess this issue is related to #353

This will also help me in defining good test cases for the serous bug #361

roland-ewald commented 6 years ago

@visze Great! Any preference regarding output format or naming of the commands? I'm seeing different pros and cons for any of the formats that come to mind (XML, JSON, CSV), but so far I would be leaning towards JSON.

pnrobinson commented 6 years ago

agree with JSON

roland-ewald commented 6 years ago

@pnrobinson great; this should be fairly straightforward. I will try to prioritize this as it seems to be helpful for other issues (#361, or testing #372).

holtgrewe commented 5 years ago

FWIW, the SV support will add limiting database building to certain genes/transcripts only.

holtgrewe commented 5 years ago

We'll happily accept a tested PR but I'm flagging this as backlog as the ticket has been dormant.

pnrobinson commented 5 years ago

@julesjacobsen -- is there a protobuf serialization of JannovarData objects? Could that be adapted to this purpose?

julesjacobsen commented 5 years ago

@pnrobinson naturally.... https://github.com/exomiser/Exomiser/blob/master/exomiser-core/src/main/proto/jannovar.proto https://github.com/exomiser/Exomiser/blob/master/exomiser-core/src/main/java/org/monarchinitiative/exomiser/core/genome/jannovar/JannovarProtoConverter.java

It doesn't compress as well, but its about 2x faster to read. This will also serialise to/from JSON using the protobuf lib.

pnrobinson commented 5 years ago

@roland-ewald @holtgrewe thoughts on adding protobuf serialization to Jannovar (which get's us JSON nearly for free).

roland-ewald commented 5 years ago

@pnrobinson @holtgrewe I think using protobuf here is a great idea.

JSON support will make integration testing (and, if necessary, versioning transcript sub-sets) much nicer, but does not necessarily have to be very fast IMHO.

holtgrewe commented 5 years ago

Hm, I think that we're facing XKCD 927 here... I'd rather see a serialization to a minimal GFF or GTF sub set here than yet another standard. It's tempting to base it off what TranscriptModel looks like right now but in terms of portability it is not a big improvement over native serialization.

I would propose the following

Allow to import only certain genes or transcripts from any supporte transcript database (available in develop).
Use FST for serialization and deserialization to save some startup time.
Write an exporter for RefSeq and ENSEMBL and adjust the "download" tool such that you can pass a directory with a data set descriptor INI file and the necessary GTF/GFF files.

This gives us

easy to create test data sets for which visualisation already exists in popular genome browsers
faster serialization/deserialization
using community standard file formats
possibility to curate data sets

... and all without the need for new parsers/data importers etc.

FWIW, I would probably be able to implement (2) for the upcoming v0.18 release but (3) is currently out of reach for me. I'd be happy to review tested PRs and support the development with discussions.

roland-ewald commented 5 years ago

That's a valid point. Your suggestions would solve the problem as well. I'm not sure (2) is necessarily part of this problem (although a faster deserialization is nice). Point (3) may not even need to include adjustments to the download tool (the CLI). From my perspective, an easy way of using non-serialized data (be it in GFF or JSON) seems to be more relevant in use cases that interact with the API directly (e.g. for testing).

julesjacobsen commented 5 years ago

Can we put the faster de/serialisation discussion on hold for the time being. I'm happy with the situation as it is. Looking through the issues in FST there could be a couple of showstoppers, for instance there appears to be some odd behaviour using G1GC which is standard for Java 9+ The library looks great in principle, and is wonderfully simple to use (unlike protobuf), but I'm worried about possible incompatibilities.

I used protobuf because it is a fast serialisable data structure with excellent language bindings. I simply replicated the Jannovar object model to be able to store it on and read off disk fast. I was just being pragmatic, not trying to create a new data standard! However, using protobuf would mean that you could easily re-use the data in a Python application. The same could be said for any plain-text format, but you lose the speed and schema-based validation.

holtgrewe commented 5 years ago

@roland-ewald any news on this?

roland-ewald commented 5 years ago

@holtgrewe as there's some disagreement on the way to implement this and it's not really pressing, I suggest we close the issue for now (if you are OK with that).

charite / jannovar

New Feature: simplify curation of annotation transcripts #374

Problem

Potential Solution