Open yarden opened 11 years ago
This sounds like a good idea, and pretty simple to implement. Sort of a special-case scenario for when your GFF file has a known, gene-centric format (which is the format you and I work with most, but I'm trying to keep gffutils general).
I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order
Right, but this is actually on purpose -- since there's additional sorting overhead, it's disabled by default for speed. If you want them in order, then you can always use db.children(mRNA_id, order_by='start')
.
For this gene-specific case, I think it's fine to save all the children as lists in a gene model which will make manipulation as an object more intuitive. As you mentioned, most of the work is done by your GFFWriter
class - just put the various mRNAs and exons in the right places in an object rather than printing their string representation.
One unresolved issue though is where to put the get_genes
function. I don't think it should be a method on FeatureDB
, since pretty much all the other methods return iterators of Feature
objects and it could be confusing. Maybe a separate module? genemodels.py
, maybe?
A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:
The reason this is useful is because:
GFFWriter
does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do
db.children(mRNA_id)
you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so thatgene_obj.mRNAs[0]
is always the longest mRNA.I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do:
gene_obj.mRNAs[0].start
,gene_obj.mRNAs[0].stop
, etc. ThemRNA
object could even inherit from the GFF feature class.The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:
where
get_genes()
internally does something like:What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of
get_genes()
might be fast enough, but I'm not sure. Thanks, --Yarden