Closed kgururaj closed 6 years ago
One fix needed - the order in which the array directories are traversed in the reader is undefined currently. Let's say there is a workspace with 3 directories: chr1#1#100, chr1#2#200 and chr2#1#100.
@kgururaj In looking at this PR I noticed that #
is used as a separator to generate and find file/directory names based on contig/start/stop interval components, and there is code that splits interval strings using :
and -
to find the components. These are all valid characters in contig names, and the hg38 reference in particular has quite a few contig names that contain both :
and '-'. It looks like this code already exists independent of this PR. Do you want me to create a separate ticket for this issue ? It think there would be problems using such contigs with GenomicsDB.
@cmnbroad @kgururaj Those separators (":", "-") are only used in the command line configuration for testing/example purposes. I think, changes can be made but we need to identify the best way to represent chromosome intervals in this situation. I'm not sure if "#" is the best way to go here.
@cmnbroad - what Francisco said
Is there a contig name with '#' in it? I checked the HG38 (Homo_sapiens_assembly38.fasta), HG19 and GrCH37 references before deciding to use '#' as the separator.
@kgururaj I don't think the #
character appears in any of the references you mentioned. But its allowed in contig names in general, and there apparently are references that do use it.
Also, just so I understand, @francares you're saying that the only time you parse interval strings of the form contig:start-end
is in test code ? Just checking because that can definitely fail on hg38 contigs if you just use split on them - we had to make a bunch of changes to GATK to handle that.
Ok, so I don't see a '$' sign in that list - we can use that.
@cmnbroad - yep, the ':' and '-' delimiters are only used in internal test cases where we know for sure that the contig names do not have anything weird going on.
We should probably add a check for the delimiter in contig names in GenomicsDB and throw an exception.
Hopefully, final edit - when chromosome, begin and end are passed as arguments (query() function in FeatureReader), we must fixup the begin and end values based on the partition bounds. Example: Assume we have 2 partitions: 1#100#200, 1#201#300 and the query interval is begin=50, end=350. The first array should get query value (in feature iterator), chr=1, begin=max(50, 100), end=min(350, 200). The second array chr=1, begin=max(50, 201), end=min(350, 300)
I made the change for array folder delimiter from '#' to '$'.
Cleaned up Protobuf structures to support the following:
Removed unnecessary PB structure GATK4Integration. Moved parameters lb and ub row idx to ImportConfiguration (similar to the loader JSON). Other parameters are irrelevant for the core library - hence, moved to ParallelConfig.java
FIXME: