Open daler opened 11 years ago
I've been thinking about two related GFF functionalities, which I believe reflect very common pattern usages for other users too, but unfortunately no tool that I know of has these features. I'm interested in: (1) an efficient library for parsing GFF files under the hood, from Python, which ideally is dependency-free (see more on this below). (2) a command line utility for quickly "grep"-ing, slicing and viewing GFF files.
A bit of background on what I'm trying to achieve and my current issues with GFF3: I generally think that GFF3 is not a good format, and so tools like #1 and #2 are needed to make it workable, whether you're parsing GFF files in your code or just using it to store / browse annotations. A primary problem with GFF3 is that even though it's a hierarchical format, there isn't a single identifier running through each gene tree, and this combined with the fact that the entries are unordered, makes it a pain to work with. Rather than doing multiple passes through the file to collect all the relevant records, I've usually defaulted on an inelegant and inefficient approach of loading all the genes into memory, which is suboptimal and obviously does not scale. (As an example, I include a script of mine below that uses the current library to add introns to a GFF. With the right GFF parsing library, this could code be much improved.)
Related to this, GFF is missing a random access indexing format, and so I wrote my own extremely primitive code (https://github.com/yarden/MISO/blob/fastmiso/misopy/index_gff.py) to at least make it so you can easily fetch an entire gene "unit" (all of the gene's records) by the ID= of the gene. Since the gene entries of GFF are independent, and since the problem I was working on is embarrassingly parallelizable (each gene can be processed independently on a cluster system), I needed a way to instantly retrieve a parsed gene unit in a way that wouldn't cause deadlock for multiple threads. I'm generally unhappy with the code and would like to find a library that replaces my current clunky solution. A library that uses C (or Cython) but has a Python wrapper sounds like the best solution to me. I'm hoping that this library could be used in my project (https://github.com/yarden/MISO) instead of the GFF parser.
Related to the library, I'd like to find a tool that allows slicing/viewing/retrieval of GFF3 files. I really liked the way you framed the core tasks on your gffutils README, and I was imagining the same set of features, except in a form also accessible from the command line. If I combine your requirements and mine, I think the core features would be something like:
Feature 1: A "grep" like feature for gene units:
# Return all child records of 'mygene' (including 'mygene' "gene" entry). The gene entry itself,
# the mRNAs, children of the mRNAs and children of the exons if available
gffutils --fetch-gene mygene mygff.gff3
# Return all child records of 'mygeneA', 'mygeneB', 'mygeneC'
gffutils --fetch-gene mygeneA,mygeneB,mygeneC mygff.gff3
Ideally, there would be a 'limit' feature (related to BCBio parser) that allows you to "clip" the hierarchy somewhere. For example:
# Return all mRNA child records of 'mygene'; don't return anything "below" mRNA in the hierarchy
gffutils --fetch-gene mygene mygff.gff3 --limit mRNA
# Return all child records of 'mygene' including mRNAs and exons, but nothing below
# (e.g. don't return CDSS or UTR annotations if those are given as children of exons)
gffutils --fetch-gene mygene mygff.gff3 --limit exon
However, this can also be achieved by a feature analogous to grep -V that excludes out entries:
# Get all children records of a gene except CDSS and 3p_UTR, 5p_UTR
gffutils --fetch-gene mygene | gff-utils --exclude-types cdss,3p_utr,5p_utr
Feature 2: Retrieve all exons common to each gene's mRNAs
gffutils --get-common-exons mygff.gff3
(as you described in your README.)
Feature 3: This one is less essential in my view, but it'd be nice to retrieve entries by region coordinates:
# Get all entries on chr5:500-50000
gffutils --fetch-region chr5:500-50000 mygff.gff3
In any case, combining features #1 and #2, you can easily now do things per gene or set of genes, which I think would be very helpful:
# Retrieve all exons common exons only for geneA and geneB
gffutils --fetch-gene mygeneA,mygeneB | gffutils --get-common-exons -
Feature 4: Trivial but useful, a kind of "sanitize" or clean up (like your current "clean_gff") utility that checks the GFF for incomplete entries and fixes things like start > end. It would allow convenient piping like this:
# Sanitize a malformed GFF (that has start > end) and then retrieve entries for "mygene" from it
gffutils --clean bad.gff3 | gffutils --fetch-gene mygene -
In summary, the ideal library would allow one to programmatically do the features required to achieve Features #1-#4. The only thing that (for my use case at least) would be indispensable in a library, but unrelated to the command line, is the random access index for GFF. I'd love to be able to run distributed code using the library and get random access to very large GFF files this way. The way I do it now is again pretty inelegant, using index_gff.py. It cannot result in deadlock on a distributed system as mentioned, but the disadvantage is that it creates a proliferation of files and probably additional I/O overhead. However, the indexing cost is minimal (takes ~10 mins per large GFF) and I think any user would be happy to pay this minimal startup cost to build an index if it later on allows random access by gene and/or coordinate regions.
Also, I definitely agree that your idea that universal GFF <-> GTF <-> ... etc conversions would be very useful. One approach to do this is to pick the format that is internally best (I think that might be GTF, since it might be the most "explicit") and just internally convert all inputs to that format. For example, if a GFF file is passed, it can be internally converted to GTF and represented that way and then optionally serialized as GFF or GTF.
Best, --Yarden
P.S. @chapmanb 's BCBio GFF parser (http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/) looks really nice from what I can tell, but the dependency on BioPython makes it unusable for me as a library, since I'd like to package this with MISO and users are already complaining about the various Python dependencies (in fact, most people have trouble installing scipy and I'm working to remove this requirement - unfortunate since scipy is valuable.) I also don't use BioPython for anything else at the moment, so wouldn't want to tack it on as a dependency just for GFF parsing abilities.
##
## Add introns to GFF file
##
import misopy
import misopy.gff_utils as gff_utils
import misopy.Gene as gene_utils
def add_introns_to_gff(gff_filename, output_dir):
"""
Add 'intron' entries to GFF.
"""
output_basename = \
utils.trim_gff_ext(os.path.basename(gff_filename))
output_filename = \
os.path.join(output_dir,
"%s.with_introns.gff" %(output_basename))
print "Adding introns to GFF..."
print " - Input: %s" %(gff_filename)
print " - Output: %s" %(output_filename)
gff_out = gff_utils.Writer(open(output_filename, "w"))
gff_db = gff_utils.GFFDatabase(from_filename=gff_filename,
reverse_recs=True)
t1 = time.time()
genes = gene_utils.load_genes_from_gff(gff_filename)
for gene_id in genes:
gene_info = genes[gene_id]
gene_tree = gene_info["hierarchy"]
gene_obj = gene_info["gene_object"]
gene_rec = gene_tree[gene_id]["gene"]
# Write the GFF record
gff_out.write(gene_rec)
# Write out the mRNAs, their exons, and then
# input the introns
for mRNA_id in gene_tree[gene_id]["mRNAs"]:
curr_mRNA = gene_tree[gene_id]["mRNAs"][mRNA_id]
gff_out.write(curr_mRNA["record"])
# Write out the exons
curr_exons = gene_tree[gene_id]["mRNAs"][mRNA_id]["exons"]
for exon in curr_exons:
gff_out.write(curr_exons[exon]["record"])
# Now output the introns
for isoform in gene_obj.isoforms:
intron_coords = []
for first_exon, second_exon in zip(isoform.parts,
isoform.parts[1::1]):
# Intron start coordinate is the coordinate right after
# the end of the first exon, intron end coordinate is the
# coordinate just before the beginning of the second exon
intron_start = first_exon.end + 1
intron_end = second_exon.start - 1
if intron_start >= intron_end:
continue
intron_coords.append((intron_start, intron_end))
# Create record for this intron
intron_id = "%s:%d-%d:%s" %(gene_obj.chrom,
intron_start,
intron_end,
gene_obj.strand)
intron_rec = \
gff_utils.GFF(gene_obj.chrom, gene_rec.source, "intron",
intron_start, intron_end,
attributes={"ID": [gene_obj.label],
"Parent": [isoform.label]})
gff_out.write(intron_rec)
t2 = time.time()
print "Addition took %.2f minutes." %((t2 - t1)/60.)
One related question for you with respect to using gffutils
: ), I wrote above that a key issue for me was finding an indexed representation of GFF where distinct cluster jobs, on different nodes, can access the entries of different genes without deadlock. The simple-minded solution I had was just to split the GFF into an indexed representation where you have each gene as a separate file. gffutils
's sqlite database is a much better solution, and nicer to work with than my pickled representation. Do you know if there will be issues if distinct jobs/threads try to access the same .db
sqlite database? Although none will be writing to it, only reading it, I'm not sure if that will be an issue.
Anyway, I've been playing with gffutils now and think it's really great. If you'd like code contributions, let me know what the best way to proceed is (I can fork it and submit patches that are of potential interest, etc.) Thanks, --Yarden
Yarden and Ryan; I'd be happy to contribute code and help with this. I'm agreed with all your points about GFF3 format peculiarities, and that a good first step is sorting/indexing a file so you can extract sections and be able to reason about gene groupings.
My GFF parser exposes Biopython objects but the internals use a simple dictionary to represent GFF lines so this could be abstracted away if the dependency is a problem but the code is of interest. As an aside, Biopython is a much more lightweight dependency than SciPy: it has no hard dependencies on other packages.
From my point of view it would be great to have a single underlying Python parser and general representation. This could be supplemented with add ons like indexing and Biopython object creation. Let me know how I can help, Brad
Yarden & Brad:
It's great to get you guys involved.
I'll try and go through your comments sequentially . . .
Yarden (API):
The piping CLI API looks very cool. Most of the functionality you're looking for is already implemented, at least in un-piped workflows. The challenge I think is in determining what is piped to other commands.
For example, this command:
gffutils --fetch-gene mygene my.gff3
would presumably create a db if one doesn't already exist, and return the results to stdout as a subset of the original GFF. Upon piping though,
gffutils --fetch-gene mygene | gff-utils --exclude-types cdss,3p_utr,5p_utr
the second command would (I think) have to convert stdout to another temporary db, and then operate on that. Though I suppose whether or not a full db is created will depend on the command. So while the piping business would be very cool, I think it will take some thought to get right. I may open another issue to discuss this specific topic to keep things organized.
Yarden (multiple threads): I've tried multiple writes on sqlite3 dbs, and I definitely get deadlocks. I have successfully used multiple reads, but haven't really hit it that hard to push its limits (8 or fewer processes seem to work fine).
Yarden/Brad (dependencies): I agree with Brad that BioPython is not that bad of a dependency to require. Nothing near SciPy annoyances. However I do think it would be nice to have a no-dependency parsing core, with BioPython and other parts as sort of optional add-ons as Brad suggests.
Yarden/Brad (contributions): Yes please. Just fork and submit pull requests. Incremental, standalone pull requests are easiest to review, and including tests makes it that much easier for me (and hopefully makes a more robust library).
I'll email you each a document that may help orient you to the code and the strategies I ended up using so that you're not doing a cold read of the source.
In the end, I think a hybrid of Brad's parser and the sqlite db backend would make for a nice self-contained library.
Hi Brad and Ryan,
I think it'll be great to work on this together, and as Brad said, I'm happy to contribute code/tests/examples etc. I'll respond to some of Ryan's points below:
API:
I was using the *.gff3 in the command as an example. I'd be happy with it requiring an indexed db to do anything, rather than worrying about creating temporary DBs on the fly. Once nice convention would be that if you give it my.gff3
it'll look for my.gff3.db
(similar to BAM's convention of having a .bam
and .bam.bai
). For the specific case of this command:
gffutils --fetch-gene mygene my.gff3
The most intuitive behavior for this in my view would be printing a plain GFF output to stdout
containing just the records for mygene
, which the user can save / parse / reuse. You can imagine an optional flag --as-db mygene.db
that instead dumps it to mygene.db
file where it is indexed, or a pipe that passes it on the indexer (e.g. gffutils --fetch-gene mygene mygff3 | gffutils --create-db - mygene.db
). But in general, if you're pulling just one or a handful of genes from the GFF, the output won't be that large and so it doesn't sound necessary to index it on the fly. If gffutils
didn't create the db on the fly but left it up to me to save the output of --fetch-gene
and index it myself using a separate gffutils
command, I would find that very reasonable. Basically, for viewing/slicing a GFF, I think there's no need to proliferate DBs under the hood, but curious to know what you guys think.
I agree that relying on a db complicates the piping that I mentioned, though I think that there are features that don't require a db at (typically minor) speed expenses. For example, gffutils --exclude-types cdss,3p_utr,5p_utr -
can read a plain GFF from stdin and exclude the types in a streaming operation, with no need for a db. Anyway, I agree piping is not essential / core functionality, and I find it very reasonable to require the users to pre-index the gff before using most features of gffutils
.
I second Ryan's comment that a combination of Brad's parser and sqlite db would be ideal. I like sqlite a lot for its portability, lack of dependencies (just needs Python) and the fact that other people using other languages can parse/query the database (which is not true of pickle or shelve.)
dependencies:
You guys are right that BioPython is not nearly as bad as scipy in terms of installation, but like Ryan suggested, the gff parsing functionality is so modular that I think it'd be ideal to keep it as a separate entity. Many of my projects would import gffutils and use it, but few of them might use BioPython, for example. It'd be nice to be able to just do pip install gffutils
or easy_install gffutils
and then include it as a dependency in setup.py
so that other users can get it automatically. This of course could be done with BioPython too, but the modularity of gff parsing makes it easily separable, and a BioPython friendly API could be included as an add-on/option like you both suggested.
As of 73d7df0a5471a641d7bb4506e3ffc9386bbe45c1 I added the script gffutils-cli
. It should be usable as-is, but there are some pieces missing. Specifically, the subcommands fetch, children, parents, create, and search are working.
It's set up to work like git/samtools/bedtools etc with subcommands.
Creation:
gffutils-cli create $GFF --output $DB
Fetching (essentially using the db simply as an indexed GFF):
gffutils-cli fetch $DB gene1,gene2
Up or down the hierarchy:
gffutils-cli children $DB gene1
gffutils-cli parents $DB exon5 --exclude mRNA --exclude-self
Full-text search on the attributes field; providing a featuretype dramatically speeds things up over grep:
gffutils-cli search $DB "Name=long_gene_name" --featuretype gene
Still working on the subcommands clean
, region
(pending integration of the UCSC binning stuff I wrote yesterday), common
, and the --limit
arg to parents
and children
. I also still haven't worked out what to do about piping. But hopefully it's a start, and I think it's organized cleanly enough to easily add and upgrade over time.
(side note: I am now a fan of the argh
package (http://argh.readthedocs.org/en/latest/) )
@chapmanb pointed this possible but quite ambitious Biopython GSoC 2013 idea of mine has some overlap (but what I had in mind was much more general than just GFF covering other annotation rich sequence formats as well): http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010473.html http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010476.html
What sort of functionality should be exposed to the CLI?
For example a universal GTF<->GFF<->refFlat conversion script would be useful.