cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

`calyx_depth.py`: use the eDSL, stop using odgi's Python bindings #105

Open anshumanmohan opened 1 year ago

anshumanmohan commented 1 year ago

Currently, calyx_depth.py generates Calyx code

  1. Without the use of the Calyx eDSL (aka the builder library).
  2. With the use of the odgi Python bindings (it's not obvious to see, but it uses parse_data which in turn uses the Python bindings).

I propose to change both of these.

  1. Using the eDSL will lighten the code significantly.
  2. The odgi Python bindings have proved hard to use. Besides, the current code uses the bindings only to infer the dimensions of the graph, and we have in-house ways of doing that now.

These relatively minor gains aside, the true win is that, after this, we will truly be using odgi only as a source of inspiration and oracle-testing, not as part of our workflow.

We currently do something like:

My proposal is something more like:

The first piece of this is already there: I already generate the same graph.data using pollen_data_gen as one would have using the usual route. The third piece should work if the first two work.

The second piece decomposes into two unrelated parts:

sampsyo commented 1 year ago

Wonderful; thank you for summarizing this so clearly! Indeed, both of these would be great advances that would "modernize" our hardware generation.

One small operational note: the eDSL adoption can happen incrementally, if it's a lot of work. That is, it's not necessary to flip everything over all at once; it suffices to convert individual pieces to use the builder library one at a time since it interoperates with the "plain" AST library pretty smoothly.

anshumanmohan commented 1 year ago

Thanks Adrian! Dropping the "triage required" tag. This issue remains up for grabs.

susan-garry commented 11 months ago

This looks good to me! I agree that switching over the builder library isn't a top priority, but it would certainly be nice to at some point.

I don't necessarily think it is a bad thing to use odgi in our hardware generation if it turns out that it's faster to transform a .gfa graph to an .og graph and then use odgi to get its dimensions, rather than just obtaining the dimensions from the .gfa file directly, but since .gfa files seem to be sort of a "standard representation" for pangenomic graphs, it would be good to at least transition exine to taking in .gfa files as input rather than .og files, which it currently expects. I also think it would be a win just to avoid making pollen users install odgi.

sampsyo commented 11 months ago

Indeed. Sticking with .gfa files is the right thing to do here because (a) they are the "source of truth," (b) odgi is a difficult-to-install dependency (to put it lightly), and (c) performance at hardware-generation time is not very important.