bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
996 stars 309 forks source link

`Symbol` case class is nucleotide-centric #672

Closed laserson closed 8 years ago

laserson commented 9 years ago

Not sure if this really matters, but just wanted to address that the Symbol class in the new Alphabet machinery includes a "complement" field, which doesn't make sense for proteins or other things (e.g., secondary structure). Perhaps necessary to subclass Alphabet? Thoughts?

massie commented 9 years ago

Agree. Should we consider incorporating BioScala into ADAM?

https://github.com/bioscala/bioscala

It's BSD licensed.

fnothaft commented 9 years ago

I don't know of anyone using ADAM for proteomic or RNA secondary structure work, so I'm inclined to not make another change for now.

BioScala does look like an attractive package if we were to go that route, but it doesn't look like it is actively being developed. CCing @antonkulaga who I see has committed there a lot, who might know more about the status of the BioScala project.

laserson commented 9 years ago

I'm not familiar with that project, but I'm not opposed. Also, what about the issue of persisting Alphabet information?

fnothaft commented 9 years ago

Also, what about the issue of persisting Alphabet information?

Sorry, I didn't follow here. What do you mean by that?

laserson commented 9 years ago

Sorry, just thinking aloud. Wondering whether it'd be useful to model Alphabets at the Avro level. But that's probably unnecessary complication...

fnothaft commented 9 years ago

@laserson we used to have a Base enum in the Avro level, but it wound up not being terribly useful so we scrapped it in https://github.com/bigdatagenomics/bdg-formats/pull/46. If I could TL;DR my experiences with enum based alphabets, the problem is that strings are reasonably efficient and a fair bit easier to work with.

fnothaft commented 9 years ago

Also, I would say that it would probably be sufficient to just break the "reverse complement-able" section of alphabets out into another trait. E.g., you'd have Alphabet and ComplementableAlphabet or something of the like.

fnothaft commented 9 years ago

Do we want to keep this open or close it? IMO, this is not a problem for now. I think we should close it and reopen it later if we decide to work with protein sequences/etc.

fnothaft commented 8 years ago

Closing as won't fix.