Closed DimitrisStaratzis closed 2 months ago
This pull request has been linked to Shortcut Story #38682: Wrap set_vcf_reader_bed_array() in TileDB-VCF-Spark..
After a closer look some work needs to be done to enable the bedarray with the new partition method
. More specifically, I see that the new partition method is using this method:
List<List<String>> computeRegionPartitionsFromBedFile(int desiredNumRangePartitions) {
Optional<URI> bedURI = options.getBedURI();
log.info("Init VCFReader for partition calculation");
String uriString = options.getDatasetURI().get().toString();
Optional<String> credentialsCsv =
options
.getCredentialsProvider()
.map(CredentialProviderUtils::buildConfigMap)
.flatMap(VCFDataSourceOptions::getConfigCSV);
Optional<String> configCsv =
VCFDataSourceOptions.combineCsvOptions(options.getConfigCSV(), credentialsCsv);
String[] samples = new String[] {};
VCFReader vcfReader = new VCFReader(uriString, samples, options.getSampleURI(), configCsv);
VCFBedFile bedFile = new VCFBedFile(vcfReader, bedURI.get().toString());
Map<String, List<String>> mapOfRegions = bedFile.getContigRegionStrings();
List<List<String>> res = new LinkedList<>(mapOfRegions.values());
// Sort the region list by size of regions in contig, largest first
res.sort(Comparator.comparingInt(List<String>::size).reversed());
// Keep splitting the larges region lists until we have the desired minimum number of range
// Partitions, we stop if the large region has a size of 10 or less
while (res.size() < desiredNumRangePartitions && res.get(0).size() >= 10) {
List<String> top = res.remove(0);
List<String> first = new LinkedList<>(top.subList(0, top.size() / 2));
List<String> second = new LinkedList<>(top.subList(top.size() / 2, top.size()));
res.add(first);
res.add(second);
// Sort the region list by size of regions in contig
res.sort(Comparator.comparingInt(List::size));
Collections.reverse(res);
}
bedFile.close();
vcfReader.close();
return res;
}
So I need to implement
List<List<String>> computeRegionPartitionsFromBedArray(int desiredNumRangePartitions) {}
basically what I need to do here is replace bedFile.getContigRegionStrings(); with something like bedArray.getContigRegionStrings();
George had provided this example in C++ which I will be using as a guide and try to do the same in Java https://github.com/TileDB-Inc/TileDB-VCF/blob/5bcc79b07935ac540c56bf6ed9ee0f5d60bf247e/libtiledbvcf/src/vcf/region.cc#L141-L190
CI is green!
This PR follows #646 [sc-38682] and enables the setting of a TileDB bed array in both Spark2 and Spark3. To indicate the correctness of the new feature, new tests have been added