TileDB-Inc / TileDB-VCF

Efficient variant-call data storage and retrieval library using the TileDB storage library.
https://tiledb-inc.github.io/TileDB-VCF/
MIT License
83 stars 13 forks source link

[Spark] Add option to use a bed Array #648

Closed DimitrisStaratzis closed 2 months ago

DimitrisStaratzis commented 6 months ago

This PR follows #646 [sc-38682] and enables the setting of a TileDB bed array in both Spark2 and Spark3. To indicate the correctness of the new feature, new tests have been added

shortcut-integration[bot] commented 6 months ago

This pull request has been linked to Shortcut Story #38682: Wrap set_vcf_reader_bed_array() in TileDB-VCF-Spark..

DimitrisStaratzis commented 5 months ago

After a closer look some work needs to be done to enable the bedarray with the new partition method. More specifically, I see that the new partition method is using this method:

List<List<String>> computeRegionPartitionsFromBedFile(int desiredNumRangePartitions) {
  Optional<URI> bedURI = options.getBedURI();

  log.info("Init VCFReader for partition calculation");
  String uriString = options.getDatasetURI().get().toString();

  Optional<String> credentialsCsv =
      options
          .getCredentialsProvider()
          .map(CredentialProviderUtils::buildConfigMap)
          .flatMap(VCFDataSourceOptions::getConfigCSV);

  Optional<String> configCsv =
      VCFDataSourceOptions.combineCsvOptions(options.getConfigCSV(), credentialsCsv);

  String[] samples = new String[] {};
  VCFReader vcfReader = new VCFReader(uriString, samples, options.getSampleURI(), configCsv);

  VCFBedFile bedFile = new VCFBedFile(vcfReader, bedURI.get().toString());

  Map<String, List<String>> mapOfRegions = bedFile.getContigRegionStrings();
  List<List<String>> res = new LinkedList<>(mapOfRegions.values());

  // Sort the region list by size of regions in contig, largest first
  res.sort(Comparator.comparingInt(List<String>::size).reversed());

  // Keep splitting the larges region lists until we have the desired minimum number of range
  // Partitions, we stop if the large region has a size of 10 or less
  while (res.size() < desiredNumRangePartitions && res.get(0).size() >= 10) {

    List<String> top = res.remove(0);

    List<String> first = new LinkedList<>(top.subList(0, top.size() / 2));
    List<String> second = new LinkedList<>(top.subList(top.size() / 2, top.size()));
    res.add(first);
    res.add(second);

    // Sort the region list by size of regions in contig
    res.sort(Comparator.comparingInt(List::size));
    Collections.reverse(res);
  }

  bedFile.close();
  vcfReader.close();

  return res;
}

So I need to implement

List<List<String>> computeRegionPartitionsFromBedArray(int desiredNumRangePartitions) {}

basically what I need to do here is replace bedFile.getContigRegionStrings(); with something like bedArray.getContigRegionStrings();

George had provided this example in C++ which I will be using as a guide and try to do the same in Java https://github.com/TileDB-Inc/TileDB-VCF/blob/5bcc79b07935ac540c56bf6ed9ee0f5d60bf247e/libtiledbvcf/src/vcf/region.cc#L141-L190

DimitrisStaratzis commented 2 months ago

CI is green!