Closed laserson closed 8 years ago
Agree. Should we consider incorporating BioScala into ADAM?
https://github.com/bioscala/bioscala
It's BSD licensed.
I don't know of anyone using ADAM for proteomic or RNA secondary structure work, so I'm inclined to not make another change for now.
BioScala does look like an attractive package if we were to go that route, but it doesn't look like it is actively being developed. CCing @antonkulaga who I see has committed there a lot, who might know more about the status of the BioScala project.
I'm not familiar with that project, but I'm not opposed. Also, what about the issue of persisting Alphabet information?
Also, what about the issue of persisting Alphabet information?
Sorry, I didn't follow here. What do you mean by that?
Sorry, just thinking aloud. Wondering whether it'd be useful to model Alphabets at the Avro level. But that's probably unnecessary complication...
@laserson we used to have a Base
enum in the Avro level, but it wound up not being terribly useful so we scrapped it in https://github.com/bigdatagenomics/bdg-formats/pull/46. If I could TL;DR my experiences with enum
based alphabets, the problem is that strings are reasonably efficient and a fair bit easier to work with.
Also, I would say that it would probably be sufficient to just break the "reverse complement-able" section of alphabets out into another trait. E.g., you'd have Alphabet
and ComplementableAlphabet
or something of the like.
Do we want to keep this open or close it? IMO, this is not a problem for now. I think we should close it and reopen it later if we decide to work with protein sequences/etc.
Closing as won't fix.
Not sure if this really matters, but just wanted to address that the
Symbol
class in the newAlphabet
machinery includes a "complement" field, which doesn't make sense for proteins or other things (e.g., secondary structure). Perhaps necessary to subclassAlphabet
? Thoughts?