fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
308 stars 67 forks source link

Scala API for accessing FASTA reference sequences #579

Open tfenne opened 4 years ago

tfenne commented 4 years ago

Anywhere in fgbio we need to access FASTA for reference genomes or similar we end up using HTSJDK's ReferenceSequenceFile and related classes (ReferenceSequenceFileFactory, ReferenceSequence). These classes get the job done, and offer a fair amount of functionality, but they are old and their design is a little clunky, especially coming from scala.

We have previously generated scala APIs for concepts in HTSJDK for both SAM/BAM and for VCF. We'd now like to do that for reference sequences. I'm 99.9% sure the right way to do this is to wrap the HTSJDK classes and try to never expose them in the scala API (similar to the VCF wrapper, and less like the SAM/BAM wrapper).

Some things that would be nice in the design:

It may also be that it's time for us to think about having a scala implementation/wrapper of SequenceDictionary as that's often used heavily in conjunction with reference sequences, and is also super clunky.

FWIW I also have code kicking around somewhere for various activities that might be useful to think about and/or pull into fgbio as use cases for this, including:

I would suggest starting this ticket by:

@nh13 Anything to add?

sstadick commented 4 years ago

Analysis

Overview of existing implementations:

HTSJDK

rust-bio

pyfaidx

bio-julia

How HTSJDK implementations used in fgbio

Usage of HTSJDK impl of ReferenceSequenceFile in the wild

Proposed API

After reviewing the above usages and implementations, a wrapper similar to pyfaidx / bio-juila seems most usable.

Examples

val input: PathToFasta = ....
val refSeq = NeReferenceSequeceFile(input) // Support the same overloading as Factory method, also add something like a 'build index if index does not exist'
val chr1 = refSeq("chr1") // if not indexed, fail
val names = refSeq.keys
for (name, seq <- refSeq) println(s">$name \n$seq") // works even if not indexed, just make an iterator like `ReferenceSequenceIterator`
chr1.slice(40, 50)
chr1.hardMask(_.isLower)
chr1.hardMask(5) // hard mask from 5 to end
chr1.softMask(0, 5) // soft mask from 0 to 5

This hides the SequenceDictionary in the NewReferenceSequenceFile and provides a unified API for indexed, unindexed, and range extraction. A NewReferenceSequence will also be wrapped to act like a string, with stringOps methods, as well as a few specific methods like description, etc.

As to clarity around when you are accessing metadata vs pulling something into memory, I'm assuming this is referring to metadata available in the index file that is 'free'? I would propose the NewReferenceSequence have a second apply method to indicate you want a lazy record ex refSeq("chr1", true) that would return only metadata without actually seeking the record. Are there good examples of where this might be used?

tfenne commented 4 years ago

Still thinking about this in general ... but I think it would be nice if the masking options and any other mutational operation were passed in the factory calls. If we do that then everything returned from that point on can be immutable, and be totally safe to share across threads and use in parallel collection methods. V.s. the HTSJDK model is to return mutable Array[Byte] which you have to mess with if you want e.g. everything upper-case. Which makes it dangerous to share across threads.

tfenne commented 4 years ago

A few other thoughts in no particular order:

  1. It will likely be useful to have val chr1 = refSeq(0) also that looks up by sequence index (in addition to by name)
  2. ReferenceSequenceFile in HTSJDK, as well as out internal SAM and VCF APIs all use 1-based inclusive coordinates. We should do likewise here and that might impact the method names we use for sub-sequence access
  3. Do we want a chr1.force or similar that would drag all of chr1 into memory? E.g. I might write code and know I'm going to access enough of a chromosome to make it worth loading into memory for sub-sequence access
  4. I make frequency use of the ability to load the .dict or sequence dictionary that is by convention alongside the fasta in many cases. It contains some redundant info with the .fai and some unique information. We could either have an option to find and load it given a reference, or always load it if it's there and include the information in the NewReferenceSequence object(s)
  5. One thing that is in the dict that would be nice to have good use of is a list of aliases (see example below). It would be nice a) to be able to do refSeq("MT") and get back chrM and b) do refSeq("chrM").names or similar and get back Seq("chrM", "MT", "J01415.2", "NC_012920.1")
@SQ SN:chrM LN:16569    M5:c68f52674c9fb33aef52dcf399755519 AS:hg38 SP:Homo sapiens UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz    AN:MT,J01415.2,NC_012920.1
@SQ SN:chr1 LN:248956422    M5:2648ae1bacce4ec4b6cf337dcae37816 AS:hg38 SP:Homo sapiens UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz    AN:1,CM000663.2,NC_000001.11
@SQ SN:chr2 LN:242193529    M5:4bb4f82880a14111eb7327169ffb729b AS:hg38 SP:Homo sapiens UR:http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz    AN:2,CM000664.2,NC_000002.12
sstadick commented 4 years ago

Excellent points. I think I'm on board with all of them:

  1. The only thing I don't love about this is that presumably every time I ask for refSeq("chr1") the transform would run. But maybe this isn't that common of a usage pattern, and if it's called out it can be avoided.
  2. Yes, great add
  3. .subseq instead of .slice
  4. I feel like there are two conflicting use cases: when no index is present it should just iterate over over things and give you the sequence. When an index is present, you should be able to ask for a readAt to avoid pulling everything at once. .force get's tricky because then the NewReferenceSequence would have to be mutable. So maybe there should be anther way of creating the NewReferenceSequence.
    val chr1SubSeq = refSeq("chr1", 30, 1000)
    val chr1UpTo100 = refSeq("chr1", Some(1), Some(100))
    val chr1From100ToEnd = refSeq("chr1", Some(100))
    val chr = refSeq("chr1") // gets all of chr1, like .force

mock implementation:


class RefFile() {

  val map: Map[String, String] = Map(
    "chr1" -> "ACTGaCTGAGCTaGA",
    "chr2" -> "GGGGGGGGGGGGGGG"
  )

  def apply(name: String): String = map(name)
  def apply(name: String, start: Int) = map(name).substring(start)
  def apply(name: String, start: Int, stop: Int) = map(name).substring(start, stop)
  def apply(index: Int): String = map(map.keys.toList(index))
  def apply(index: Int, start: Int) = map(map.keys.toList(index)).substring(start)
  def apply(index: Int, start: Int, stop: Int) = map(map.keys.toList(index)).substring(start, stop)
}

object RefFile {
  def apply(): RefFile = new RefFile()
}

object Main extends App {
  val refFile = RefFile()
  println(refFile("chr1", 5)) // CTGAGCTaGA
  println(refFile("chr2")) // GGGGGGGGGGGGGGG
  println(refFile("chr1", 5, 10)) // CTGAG
  println(refFile(0, 5, 10)) // CTGAG
}
  1. Yes, I like the idea of doing it with an option / explicitly, but am open to making it more magic
  2. Yes again, totally agree
mjhipp commented 3 years ago

Sequence Dictionary converters should not be used in loops

This is not really a bug, so I did not want to open a new issue, more of a warning.

We recently updated to the new Fgbio, which changed some method signatures to return the new fasta.SequenceDictionary in place of the previous SAMSequenceDictionary. To quickly make our code compatible with the new update, we simply threw a patch of .asSam over all of the previous uses and thought nothing of it. Later, we noticed our integration tests were taking much longer than before. It turned out a single step that previously took a few seconds was now taking over 45 minutes. After some JVM profiling, we traced it back to the .asSam conversions that were used inside a loop.

I wrote a small script that illustrates the differences when simply accessing a dictionary in a loop. I used the hs38DH reference sequence for this illustration.

Script used ```scala @clp( description = """ |Loop that accesses a Dict for benchmarking. """, group = ClpGroups.VcfOrBcf ) class DictLoop( @arg(flag = 'n', doc = "Number of loop iterations.") val count: Int, @arg(flag = 'r', doc = "Path to the reference fasta file.") val ref: PathToFasta, ) extends FgBioTool with LazyLogging { override def execute(): Unit = { val refSeq = ReferenceSequenceFileFactory.getReferenceSequenceFile(ref) val samDict = refSeq.getSequenceDictionary val fastaDict = samDict.fromSam refSeq.safelyClose() logger.info("-----------------------------------------------------------------------------") logger.info(s"Accessing SAMSequenceDictionary $count times WITHOUT conversion step.") var start = System.nanoTime() forloop(from = 0, until = count) { _ => samDict.getSequence(0) } var end = System.nanoTime() var elapsed = (end - start) * 1.0 / 1000 / 1000 logger.info(s"Elapsed time: $elapsed ms.") logger.info("-----------------------------------------------------------------------------") logger.info(s"Accessing fasta.SequenceDictionary $count times WITHOUT conversion step.") start = System.nanoTime() forloop(from = 0, until = count) { _ => fastaDict(0) } end = System.nanoTime() elapsed = (end - start) * 1.0 / 1000 / 1000 logger.info(s"Elapsed time: $elapsed ms.") logger.info("-----------------------------------------------------------------------------") logger.info(s"Accessing SAMSequenceDictionary $count times with asSam conversion step.") start = System.nanoTime() forloop(from = 0, until = count) { _ => fastaDict.asSam.getSequence(0) } end = System.nanoTime() elapsed = (end - start) * 1.0 / 1000 / 1000 logger.info(s"Elapsed time: $elapsed ms.") logger.info("-----------------------------------------------------------------------------") logger.info(s"Accessing fasta.SequenceDictionary $count times with fromSam conversion step.") start = System.nanoTime() forloop(from = 0, until = count) { _ => samDict.fromSam(0) } end = System.nanoTime() elapsed = (end - start) * 1.0 / 1000 / 1000 logger.info(s"Elapsed time: $elapsed ms.") logger.info("-----------------------------------------------------------------------------") } } ```

Results:


[2021/01/04 11:51:20 | DictLoop | Info] Accessing SAMSequenceDictionary 100 times WITHOUT conversion step.
[2021/01/04 11:51:20 | DictLoop | Info] Elapsed time: 1.032039 ms.
[2021/01/04 11:51:20 | DictLoop | Info] --------------
[2021/01/04 11:51:20 | DictLoop | Info] Accessing fasta.SequenceDictionary 100 times WITHOUT conversion step.
[2021/01/04 11:51:20 | DictLoop | Info] Elapsed time: 0.300041 ms.
[2021/01/04 11:51:20 | DictLoop | Info] --------------
[2021/01/04 11:51:20 | DictLoop | Info] Accessing SAMSequenceDictionary 100 times with asSam conversion step.
[2021/01/04 11:51:22 | DictLoop | Info] Elapsed time: 1682.434943 ms.
[2021/01/04 11:51:22 | DictLoop | Info] --------------
[2021/01/04 11:51:22 | DictLoop | Info] Accessing fasta.SequenceDictionary 100 times with fromSam conversion step.
[2021/01/04 11:51:23 | DictLoop | Info] Elapsed time: 1038.450767 ms.

[2021/01/04 11:52:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:11 | DictLoop | Info] Accessing SAMSequenceDictionary 100 times WITHOUT conversion step.
[2021/01/04 11:52:11 | DictLoop | Info] Elapsed time: 1.0727090000000001 ms.
[2021/01/04 11:52:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:11 | DictLoop | Info] Accessing fasta.SequenceDictionary 100 times WITHOUT conversion step.
[2021/01/04 11:52:11 | DictLoop | Info] Elapsed time: 0.304225 ms.
[2021/01/04 11:52:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:11 | DictLoop | Info] Accessing SAMSequenceDictionary 100 times with asSam conversion step.
[2021/01/04 11:52:12 | DictLoop | Info] Elapsed time: 1709.4891499999999 ms.
[2021/01/04 11:52:12 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:12 | DictLoop | Info] Accessing fasta.SequenceDictionary 100 times with fromSam conversion step.
[2021/01/04 11:52:13 | DictLoop | Info] Elapsed time: 1073.656193 ms.
[2021/01/04 11:52:13 | DictLoop | Info] -----------------------------------------------------------------------------

[2021/01/04 11:52:59 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:59 | DictLoop | Info] Accessing SAMSequenceDictionary 1000 times WITHOUT conversion step.
[2021/01/04 11:52:59 | DictLoop | Info] Elapsed time: 1.272233 ms.
[2021/01/04 11:52:59 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:59 | DictLoop | Info] Accessing fasta.SequenceDictionary 1000 times WITHOUT conversion step.
[2021/01/04 11:52:59 | DictLoop | Info] Elapsed time: 0.773036 ms.
[2021/01/04 11:52:59 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:52:59 | DictLoop | Info] Accessing SAMSequenceDictionary 1000 times with asSam conversion step.
[2021/01/04 11:53:14 | DictLoop | Info] Elapsed time: 15276.873759 ms.
[2021/01/04 11:53:14 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:53:14 | DictLoop | Info] Accessing fasta.SequenceDictionary 1000 times with fromSam conversion step.
[2021/01/04 11:53:24 | DictLoop | Info] Elapsed time: 10099.052911 ms.
[2021/01/04 11:53:24 | DictLoop | Info] -----------------------------------------------------------------------------

[2021/01/04 11:53:41 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:53:41 | DictLoop | Info] Accessing SAMSequenceDictionary 10000 times WITHOUT conversion step.
[2021/01/04 11:53:41 | DictLoop | Info] Elapsed time: 1.898992 ms.
[2021/01/04 11:53:41 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:53:41 | DictLoop | Info] Accessing fasta.SequenceDictionary 10000 times WITHOUT conversion step.
[2021/01/04 11:53:41 | DictLoop | Info] Elapsed time: 2.3582020000000004 ms.
[2021/01/04 11:53:41 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:53:41 | DictLoop | Info] Accessing SAMSequenceDictionary 10000 times with asSam conversion step.
[2021/01/04 11:56:37 | DictLoop | Info] Elapsed time: 175819.371773 ms.
[2021/01/04 11:56:37 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:56:37 | DictLoop | Info] Accessing fasta.SequenceDictionary 10000 times with fromSam conversion step.
[2021/01/04 11:58:29 | DictLoop | Info] Elapsed time: 111949.549237 ms.
[2021/01/04 11:58:29 | DictLoop | Info] -----------------------------------------------------------------------------

[2021/01/04 11:59:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:59:11 | DictLoop | Info] Accessing SAMSequenceDictionary 20000 times WITHOUT conversion step.
[2021/01/04 11:59:11 | DictLoop | Info] Elapsed time: 2.286664 ms.
[2021/01/04 11:59:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:59:11 | DictLoop | Info] Accessing fasta.SequenceDictionary 20000 times WITHOUT conversion step.
[2021/01/04 11:59:11 | DictLoop | Info] Elapsed time: 2.453585 ms.
[2021/01/04 11:59:11 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 11:59:11 | DictLoop | Info] Accessing SAMSequenceDictionary 20000 times with asSam conversion step.
[2021/01/04 12:05:16 | DictLoop | Info] Elapsed time: 365050.42465 ms.
[2021/01/04 12:05:16 | DictLoop | Info] -----------------------------------------------------------------------------
[2021/01/04 12:05:16 | DictLoop | Info] Accessing fasta.SequenceDictionary 20000 times with fromSam conversion step.
[2021/01/04 12:09:09 | DictLoop | Info] Elapsed time: 232528.374247 ms.
[2021/01/04 12:09:09 | DictLoop | Info] -----------------------------------------------------------------------------
nh13 commented 3 years ago

@mjhipp I think from your examples, it's clear that the conversion should be cached outside the loop. This may not be immediately obvious when moving to the new API, so I agree with your warning for folks moving over. I think the action item here is to add a big warning to the documentation of the implicit converters to state that the conversion is expensive! If really a problem, we could think about caching the conversion inside the converter class.

@tfenne what do you think?