GenomicsDBImporter API simplification

kgururaj commented 6 years ago

Cleaned up Protobuf structures to support the following:

A GenomicsDB column can be an int64 (TileDB column) or contig:position
A GenomicsDB column interval can be <int64, int64> or contig:begin-end

Removed unnecessary PB structure GATK4Integration. Moved parameters lb and ub row idx to ImportConfiguration (similar to the loader JSON). Other parameters are irrelevant for the core library - hence, moved to ParallelConfig.java

FIXME:

ChromosomeInterval and ContigInterval (PB structure) are duplicates - remove ChromosomeInterval completely

kgururaj commented 6 years ago

One fix needed - the order in which the array directories are traversed in the reader is undefined currently. Let's say there is a workspace with 3 directories: chr1#1#100, chr1#2#200 and chr2#1#100.

All directories for a single chromosome must be traversed together. So, you can traverse both the chr1 directories first or chr2 followed by chr1 directories, but you cannot interleave them.
For a single chromosome, directories must be traversed in increasing order of position.

cmnbroad commented 6 years ago

@kgururaj In looking at this PR I noticed that # is used as a separator to generate and find file/directory names based on contig/start/stop interval components, and there is code that splits interval strings using : and - to find the components. These are all valid characters in contig names, and the hg38 reference in particular has quite a few contig names that contain both : and '-'. It looks like this code already exists independent of this PR. Do you want me to create a separate ticket for this issue ? It think there would be problems using such contigs with GenomicsDB.

francares commented 6 years ago

@cmnbroad @kgururaj Those separators (":", "-") are only used in the command line configuration for testing/example purposes. I think, changes can be made but we need to identify the best way to represent chromosome intervals in this situation. I'm not sure if "#" is the best way to go here.

kgururaj commented 6 years ago

@cmnbroad - what Francisco said

Is there a contig name with '#' in it? I checked the HG38 (Homo_sapiens_assembly38.fasta), HG19 and GrCH37 references before deciding to use '#' as the separator.

cmnbroad commented 6 years ago

@kgururaj I don't think the # character appears in any of the references you mentioned. But its allowed in contig names in general, and there apparently are references that do use it.

Also, just so I understand, @francares you're saying that the only time you parse interval strings of the form contig:start-end is in test code ? Just checking because that can definitely fail on hg38 contigs if you just use split on them - we had to make a bunch of changes to GATK to handle that.

kgururaj commented 6 years ago

Ok, so I don't see a '$' sign in that list - we can use that.

@cmnbroad - yep, the ':' and '-' delimiters are only used in internal test cases where we know for sure that the contig names do not have anything weird going on.

We should probably add a check for the delimiter in contig names in GenomicsDB and throw an exception.

kgururaj commented 6 years ago

Hopefully, final edit - when chromosome, begin and end are passed as arguments (query() function in FeatureReader), we must fixup the begin and end values based on the partition bounds. Example: Assume we have 2 partitions: 1#100#200, 1#201#300 and the query interval is begin=50, end=350. The first array should get query value (in feature iterator), chr=1, begin=max(50, 100), end=min(350, 200). The second array chr=1, begin=max(50, 201), end=min(350, 300)

francares commented 6 years ago

I made the change for array folder delimiter from '#' to '$'.

Intel-HLS / GenomicsDB

GenomicsDBImporter API simplification #181