cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

Generate Calyx-friendly JSON #25

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

These JSON files are generated from gfas and not odgi graphs, and are done purely in Python, without the use of odgi commands or odgi's Python bindings.

Presently I match the .data files that Susan creates for odgi depth, but I will soon extend this to expose an interface that lets us generate more or less of the gfa. Before going too far, I'm wondering about minimizing these JSON files a little; see #24.

Some next steps:

sampsyo commented 1 year ago

I see this is now a draft—care to comment about whether this is ready for a re-review or if there are other outstanding tasks?

anshumanmohan commented 1 year ago

Sorry about the silence! This fell off my radar what with all the other changes elsewhere. In the commits since your last review, I have

  1. Incorporated your comments.
  2. Brought the code up to speed with changes in mygfa, the package-import style, etc.
  3. Dovetailed testing into slow-odgi's testing and the interface into slow-odgi's CLI. These are temporary, and perhaps a mistake. This should stand alone and just borrow mygfa from slow-odgi. Better yet, mygfa should be lifted above these two separate packages. Issue forthcoming.
anshumanmohan commented 1 year ago

make test-mkjson will now run the depth-specific json-generator, with exine depth as its oracle.

I now have the ability to pass in the command-line flags n, e, and p that are used to determine the max nodes, max steps per node, and max paths respectively. There is work to be done, though:

  1. The exine json-generator adjusts the widths of fields as needed; I just stick to 4. Must fix.
  2. Fixing this will allow me to pass in larger parameters for these three flags and therefore compute JSONs for the four larger graphs. At present we don't do anything reasonable with them: exine, our oracle, complains that the parameters need to be bigger, and we ignore this, so the expect files are empty.
  3. Opening up testing to all the graphs may well reveal other issues that have not yet come up with the smaller four graphs.
  4. The final step will be to infer these parameters automatically and tightly, so I won't have to run the smaller graphs with parameters that the bigger graphs need.
anshumanmohan commented 1 year ago

If interested in testing the "simple" JSON dump,

  1. Toggle the commented lines 158/159 of __main__ so that simple_json becomes the target function
  2. Test with slow_odgi mkjson test/k.gfa and the like, not with turnt. There isn't a reasonable oracle for the simple dump.
anshumanmohan commented 1 year ago

Done with the parameter-adjusting stuff! We now mimic exine exactly: if the user provides the -a flag, as I currently do in turnt, all the parameters are inferred automatically and tightly. However, the user is free to also supply some other value(s), and any user-supplied values always take precedence.

For example, here's how you can go hard on the number of paths for no reason.

[envs.mkjson_oracle]
binary = true
command = "exine depth -d {filename} -a {filename} -p 500"
output.json = "-"

[envs.mkjson_test]
binary = true
command = "slow_odgi mkjson {filename} -p 500"
output.json = "-"

Anyway, outlandish examples aside, we now diff out correctly against all the fetch-ed GFA files!