cucapra / pollen

generating hardware accelerators for pangenomic graph queries
MIT License
24 stars 1 forks source link

slow-odgi: more thoughtful generation of segment names #54

Closed anshumanmohan closed 1 year ago

anshumanmohan commented 1 year ago

It doesn't seem like segment names are guaranteed to be numbers in the GFA format. They are in all the examples we're familiar with, but it seems to me like the format defines them to be arbitrary strings (which is why we store them as strs and not ints). Therefore:

Longer term, we probably want to generate a "fresh" name by looking at all the existing names and generating one that is actually guaranteed to be unused. Or perhaps we want to use some kind if internal numeric ID space, which we then map back to strings. Anyway, plenty of options here!

_Originally posted by @sampsyo in https://github.com/cucapra/pollen/pull/51#discussion_r1170651606_

anshumanmohan commented 1 year ago

Today I discovered that odgi build dies if you try to name a segment non-numerically.

Graph:

S   a   AAAA
S   2   TTTT
S   3   GGGG
S   4   CCCC

Error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  stol
Aborted (core dumped)
sampsyo commented 1 year ago

Wow, how about that! Good to know. So maybe we shouldn't bother doing anything about this?

anshumanmohan commented 1 year ago

Yeah I think so. At least for now, while odgi is the oracle and odgi functionality is the target, I think this issue is moot. If/when we decide to graduate from odgi-style commands and allow pangenome-ey commands more generally, we could revisit this.