cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
27 stars 1 forks source link

New Pollen Type for DNA Bases #28

Closed susan-garry closed 1 year ago

susan-garry commented 1 year ago

Currently, we represent sequences - strands of DNA base pairs - as a String (equivalently, a List of characters). However, since there are only 5 possible base pairs in odgi (ATCGN), we can represent each base using only 3 bits (as opposed to the 7/8 required to represent characters). Additionally, representing a sequence as a String allows users to insert arbitrary values into a sequence, potentially introducing subtle, difficult-to-detect bugs or corrupting the data. Instead of representing a DNA base as a character, we could represent it using a unique type. Then we can enforce the invariant that the only base pairs are ATCGN, improve our memory footprint, and possible speed up data transfer rates in certain computations (?).

Considering what this might look like, I did some back-of-the-napkin calculations and found that the chr8.pan.gfa graph stores 259525394 base pairs, which takes up 216 MB in memory if we represent characters using 7 bits. This might not sound like a lot, but I understand that the typical FPGA can only store a few megabytes at a time, so reducing the amount of memory used to 92 MB by using 3 bits to represent each base pair could reduce the time needed to compute crush for the entire graph by a factor of x2.3 (in a perfect world where the complexity of crush scales directly with the size of a sequence).

If we introduced a special Base type, the only difference to our model is that node sequences would no longer be strings:

type Segment = {
      sequence: List<Base>; //ACTNNNGAC, etc.
      edges: Set(Edge); //edges encode their direction + orientation
      steps: Set(Step); //steps that go through segment
    }

crush would become

for segment in Segments:
        seq = List<Base>(); // seq = String::new();
        in_n = false;
        for c in segment.sequence:
          if (c == 'N') {
            if (!in_n) {
              in_n = true;
              seq.append(c);  // seq.push(c);
            }
          } else {
              in_n = false;
              seq.append(c);  // seq.push(c);
          }
        emit { segment with sequence = seq};

We could even take this a step further and give List<Base> a special name, e.g., Strand. The way I conceptual this, Strand and List are are not interchangeable (i.e., not subtypes of each other). Because Base is effectively a subtype of Char, I imagine Strand as behaving like String (except the values of its characters are constrained). For example, we would still use seq.push(c) instead of seq.append(c), using the syntax for Strings rather than Lists. In this case, the node type would be as follows:

type Segment = {
      sequence: Strand; //ACTNNNGAC, etc.
      edges: Set(Edge); //edges encode their direction + orientation
      steps: Set(Step); //steps that go through segment
    }

crush:

for segment in Segments:
        seq = Strand(); // seq = String::new();
        in_n = false;
        for c in segment.sequence:
          if (c == 'N') {
            if (!in_n) {
              in_n = true;
              seq.push(c);
            }
          } else {
              in_n = false;
              seq.push(c);
          }
        emit { segment with sequence = seq};

Thoughts on the introduction of these new types? Do we prefer Base but not Strand, or Strand but different from how I described it? Neither?

sampsyo commented 1 year ago

This definitely seems worthwhile! As always when we talk about data types, it's possible to decompose the issue into two different perspectives:

  1. The abstract data model: Like you say, @susan-garry, it seems weird to be making people interact with arbitrary character strings when the data should only be one of 5 options.
  2. The concrete implementation: We can save 62.5% of the space for sequences if we use 3 bits per nucleotide instead of 8.

With that in mind, when we make a DSL, we should definitely be exposing a more constrained type like Strand instead of an arbitrary string to satisfy concern 1 above. Then, we have the freedom to choose the way we want to represent it. Namely, it could still very much make sense to use a string as the concrete representation in some cases: e.g., all languages come with super-optimized and super-capable string libraries that we can use as a kind of shortcut.

I also think it's worth asking why odgi doesn't already use a smaller sequence representation, at least for its in-memory data structure: https://github.com/pangenome/odgi/blob/bfae0b3c70a0f044d7ce8510b7a45f59d5138626/src/node.hpp#L29

The reason is probably because, on a CPU, accessing values smaller than a byte is really inconvenient. You end up having to load whole bytes and slice-and-dice them to extract out the unaligned chunks of 3 bits. Even if you save a lot of memory footprint, this extra slicing-and-dicing might outweigh that advantage (or it might not! It's just unclear). In hardware, however, that's absolutely not true: accessing 3 bits is exactly as (in)convenient as accessing 8 bits. And furthermore, every bit counts: a narrower datapath could save a lot of resources throughout a design. So it seems critical to avoid waste in this way.

susan-garry commented 1 year ago

Thank you for clarifying the different issues at play with respect to the abstract model vs. the physical implementation, this makes a lot of sense!

I suppose that since Pollen generates a lot of data with abnormal bit widths, at some point we may want to consider if this reduces program performance on CPUs and address these (hypothetical) performance issues. Does calyx by any chance already have a way to deal with this memory/speed tradeoff?

sampsyo commented 1 year ago

It doesn't really do much about this at all: all bit widths are equally (in)efficient. That is, because the target is hardware, not software, it has no special affinity for 8-bit-aligned sizes. One place we could consider changing this is in the interpreter… but I think the only place we should worry about this is if we generate CPU code (not Calyx code) from the DSL as well.