daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
284 stars 78 forks source link

Lightweight Gene class for gffutils #21

Open yarden opened 11 years ago

yarden commented 11 years ago

A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:

# recs is a set of GFF records
gene_obj = make_gene_obj(recs)
# Access the mRNAs, their exons
first_mRNA = gene_obj.mRNAs[0]
print "Exons of mRNA: ", first_mRNA
for exon_num, exon in enumerate(first_mRNA.exons):
  print "Exon %d is: " %(exon_num), exon 

The reason this is useful is because:

  1. You don't have to keep thinking of parent-child queries for canonical structures like a gene; you can think of things as objects, which is very intuitive. A gene has a list of mRNAs as one of its attributes, and each mRNA has a list of exons (or introns, or CDSS) as its attributes.
  2. The object representation would internally order the features in a predictable way, kind of like GFFWriter does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:
for exon in first_mRNA.exons:
  # access exon
  ...

then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so that gene_obj.mRNAs[0] is always the longest mRNA.

I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do: gene_obj.mRNAs[0].start, gene_obj.mRNAs[0].stop, etc. The mRNA object could even inherit from the GFF feature class.

The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:

for gene_obj in db.get_genes():
  # access gene
  ...

where get_genes() internally does something like:

def get_genes(self, ...):
   # Get records belonging to each gene
   for gene_recs in db.iter_by_parent_childs(featuretype='gene'):
     # make the gene's records into an object in canonical order
     # where mRNAs appear in attributes sorted by length
     # and where exons appear in attributes sorted by their
     # start coordinate
     gene_obj = records_to_gene(gene_recs)
     yield gene_obj

What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of get_genes() might be fast enough, but I'm not sure. Thanks, --Yarden

daler commented 11 years ago

This sounds like a good idea, and pretty simple to implement. Sort of a special-case scenario for when your GFF file has a known, gene-centric format (which is the format you and I work with most, but I'm trying to keep gffutils general).

I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order

Right, but this is actually on purpose -- since there's additional sorting overhead, it's disabled by default for speed. If you want them in order, then you can always use db.children(mRNA_id, order_by='start').

For this gene-specific case, I think it's fine to save all the children as lists in a gene model which will make manipulation as an object more intuitive. As you mentioned, most of the work is done by your GFFWriter class - just put the various mRNAs and exons in the right places in an object rather than printing their string representation.

One unresolved issue though is where to put the get_genes function. I don't think it should be a method on FeatureDB, since pretty much all the other methods return iterators of Feature objects and it could be confusing. Maybe a separate module? genemodels.py, maybe?