biocore / scikit-bio-rfcs

Request For Comments (RFCs) for scikit-bio.
Other
4 stars 5 forks source link

Propose Interval types #11

Open mortonjt opened 8 years ago

mortonjt commented 8 years ago

After talking to @tanaes, @RNAer and @josenavas , it is becoming clear that strand information should not be an optional metadata field for certain types of biological objects, such as DNA sequences.

We are thinking that perhaps introducing an OrderedInterval type that inherits the properties of Interval, but strictly enforces have a + or - strand could resolve this issue. Having this sort of type would be critical, since having an ambiguous field for strand/directional information would be a blocker for many applications.

ebolyen commented 8 years ago

I'm not sure I understand the name OrderedInterval. Also it strikes me as a little strange that DNA and friends don't have a concept of sense, but their metadata might.

I think I am okay with the general idea of subclassing for specific Sequence types though. What happens if I have some metadata that is independent of sense? (does that exist?)

tanaes commented 8 years ago

Here's the thinking:

Biological sequences are generally inherently ordered: 5'->3' for nucleic acids, N to C for proteins.

Some interval-indexed metadata will be inherently oriented with reference to that order. The most common example is likely to be sense/antisense coding regions for double stranded DNA, but you could also imagine something like protein/protein interaction regions that oriented relative to the direction of the peptide bonds. But other metadata won't: identification of chromosomes in a genome, for instance, or secondary structure motifs in a protein.

As I think about developing on top of these structures, I really want that orientation to be enforced for certain types of metadata. For example, if I have a method that deals with displaying annotations on a genome, I'd like to know for sure where I can find that orientation information for coding regions, rather than depending on the user to supply a particular keyword in the metadata.

Really I don't think the subclassing should be directly dependent on sequence type, though -- you can imagine both oriented (or ordered) and non-oriented metadata for just about any sequence type. In GFF format, this is a required field that can take +, -, or . for non-oriented annotations. Here, I think either making non-oriented annotations the top-level Interval object, or allowing the orientation keyword of OrderedInvterval take None, would be sensible approaches.

rob-knight commented 8 years ago

Note that trans splicing exists, so a single coding sequence can be drawn from both strands of a chromosome (this is what finally led me to finally give up on my C Genbank parser and rewrite it from scratch in Python).

On Jun 9, 2016, at 5:51 PM, Jon Sanders notifications@github.com wrote:

Here's the thinking:

Biological sequences are generally inherently ordered: 5'->3' for nucleic acids, N to C for proteins.

Some interval-indexed metadata will be inherently oriented with reference to that order. The most common example is likely to be sense/antisense coding regions for double stranded DNA, but you could also imagine something like protein/protein interaction regions that oriented relative to the direction of the peptide bonds. But other metadata won't: identification of chromosomes in a genome, for instance, or secondary structure motifs in a protein.

As I think about developing on top of these structures, I really want that orientation to be enforced for certain types of metadata. For example, if I have a method that deals with displaying annotations on a genome, I'd like to know for sure where I can find that orientation information for coding regions, rather than depending on the user to supply a particular keyword in the metadata.

Really I don't think the subclassing should be directly dependent on sequence type, though -- you can imagine both oriented (or ordered) and non-oriented metadata for just about any sequence type. In GFF format, this is a required field that can take +, -, or . for non-oriented annotations. Here, I think either making non-oriented annotations the top-level Interval object, or allowing the orientation keyword of OrderedInvterval take None, would be sensible approaches.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/scikit-bio-rfcs/issues/11#issuecomment-225067845, or mute the thread https://github.com/notifications/unsubscribe/AB69KYz5Q_EX8cQ0KwUoK74hUze403zqks5qKLUDgaJpZM4IxR86 .