daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
284 stars 78 forks source link

Optimization of database creation #20

Closed yarden closed 7 years ago

yarden commented 11 years ago

Is it possible to optimize database creation? It'd be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:

http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff
$ gffutils-cli create Mus_musculus.NCBIM37.65.gff

The bottleneck is there even when dbs are created in memory, though it's of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of "streaming" operation that can be done line-by-line and doesn't necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.

There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in misopy.gff_utils (see load_genes_from_gff in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simple Gene object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like "Add attribute X to all genes" or "Reformat the IDs of this GFF" it's considerably faster. It's not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.

In summary, I'm wondering if there's a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don't require many hierarchical SQL queries.

daler commented 11 years ago

Yeah, there's a LOT of optimization that can be done, especially in the refactor branch!

If you do run -p scriptname.py in IPython, you can profile the code. In a lot of my tests, the _split_keyvals() function is the rate-limiting step. The problem here is that this function does a lot of the dialect detection -- essentially trading speed for flexibility in input format.

Restructuring _split_keyvals(), or re-writing in Cython, would be a good start to speeding things up.

More directly related to the db import, there are actually 2 code paths in _GFFDBCreator.populate_from_lines() where I was doing some profiling (I left the comments in there with some results).

So: yes, lots of optimization to be done, it just hasn't been a priority in the refactor branch.

daler commented 11 years ago

After some profiling work, I found out that OrderedDicts are pretty slow. Here's a script to profile different implementations: https://gist.github.com/daler/5588154. It tries to mimic attributes of a GFF file being parsed into a dictionary.

Short answer: changing over from the current DefaultOrderedDict to a naive ordered dict gains 2.5 to 5.7x speed increase (!). Details below:

Flavors

Tests

5 keys, 3 vals per key

nlines = 200000
nkeys = 5
vals_per_key = 3

# makes this dictionary 200K times:
{0: [0, 1, 2], 1: [0, 1, 2], 2: [0, 1, 2], 3: [0, 1, 2], 4: [0, 1, 2]}

1 key, 1 val per key

nlines = 200000
nkeys = 1
vals_per_key = 1

# makes this dictionary 200K times:
{0: [0]}
yarden commented 11 years ago

Interesting! I wonder if some sort of arrangement with named tuples (or a totally Cythonized data structure instead) would be best. The dictionaries belonging to particular records in GFFs are always very small (i.e. few keys), so the overhead of a hash is not always justified.

Is it correct to assume that the core code still uses defaultdicts? With the new commits, I'm not seeing a difference in speed for operations like data base creation, like in this snippet from helpers.py:

   # Return sanitized GFF database
    sanitized_db = \
        gffutils.create_db(sanitized_iterator(), ":memory:",
                           verbose=False)

In my analysis that's mostly where the bottleneck is. --Yarden

daler commented 11 years ago

Right, I haven't changed anything yet in response to this new info. I'll look into named tuples -- I didn't think of that, thanks. Good point about the hash overhead . . . time for some more benchmarks.

daler commented 10 years ago

Hi Yarden - I've made some optimizations resulting in better performance than even the original master branch. Using http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff:

refactor branch (as of 0e584d2384f572cd2980b44ffc0973354ab92ec1)

$ time gffutils-cli create Mus_musculus.NCBIM37.65.gff
real    3m45.298s
user    2m40.716s
sys 0m12.088s

master branch

$ time gffutils-cli create Mus_musculus.NCBIM37.65.gff
real    4m55.100s
user    2m14.044s
sys 0m12.908s

Most of this came from converting the attributes of each feature into a plain ol' dictionary. But I wanted to maintain the information about the order of attributes within each feature so that GFF -> db -> GFF gets you exactly the same GFF that you put into the db in the first place. That order information is now stored once for the entire db in the dialect.

Can you test and see if it's a reasonable speed?

olgabot commented 10 years ago

Related, is it possible to parallelize the code? I'm making a FeatureDB from Gencode v19 and it's taking quite a long time. It would be great if the code was parallelizable and the database creation could be farmed out to a cluster.

daler commented 10 years ago

Hi Olga -

sqlite doesn't allow concurrent writes, so each cluster node would have to create their own db, and then all these databases would need to be merged back together. Since the GFF/GTF format allows features to be in any order, it's possible to have pathological cases where two exons for a transcript are on the first line and last line of the file respectively. I think this means that each db would need to be read in its entirety in order to ensure all parent/child relations are accounted for.

I don't have a good feel for how this would compare to going through the file on a single core, or what the optimal number of cluster nodes would be. My hunch is that creating dbs in parallel, and then using sqlite3 to sort and merge may be faster than a single CPU. I'll have to work on tests for this.

I mostly use GFF, and I'm happy with the current speed (@yarden, thoughts on this?). I have not spent as much time optimizing GTF, so I think the first step will be to see if I can get GTF on a single node to work at a reasonable speed.

daler commented 10 years ago

I downloaded the Gencode v19 and tried creating a FeatureDB. It was unacceptably slow, and I killed it after 2 hrs. I made some changes to gffutils, and it runs in about 12 mins now (see below).

For future reference, there are two reasons for the speedup. First, I added temporary indexes just prior to inferring gene and transcript extents for GTF files in _GTFDBCreator._update_relations()

CREATE INDEX relationsparent ON relations (parent);
CREATE INDEX relationschild ON relations (child);

This sped up the nested subqueries for inferring gene extent. But I noticed that importing those inferred features was taking a long time, and seemed like it was hitting a lot of ID collisions.

Looking more closely, the Gencode v19 GTF file already has genes and transcripts inferred. So gffutils was spending a lot of effort to infer the extents of genes and transcripts. Then it would re-import them . . . only to find that they already existed in the database, which meant more effort figuring out how to merge them!

So the second reason it's faster now is that I added a infer_gene_extent kwarg to gffutils.create_db(), and a corresponding --disable-infer argument to the gffutils-cli commandline tool.

With these new changes (as of 4cb0cb810887cb24d81f941bfb2d5fcc6947f4a2), it takes about 12 min to create a database out of the 2.6 million line Gencode v19 GTF file:

$ time gffutils-cli create gencode.v19.annotation.gtf --disable-infer
# real  11m38.776s
# user  9m44.689s
# sys   1m13.821s

Would you try this out to see if it works for you? Does this seem like a reasonable amount of time?

olgabot commented 10 years ago

Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..

Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test_no_fancy_id.db'

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', 
                        id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True)

But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test.db'

def transform(f):
    features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id']
    if set([f.featuretype]) & set(('gene', 'transcript')):
        return f
    else:
        exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand)
        exon_id = exon_location
        f.attributes['fancy_id'] = [exon_id]
    return f

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', 
                        id_spec={'gene': 'gene_id', 'transcript': 'transcript_id',
                                 'exon':'fancy_id', 'CDS':'fancy_id',
                                 'start_codon':'fancy_id', 'stop_codon':'fancy_id',
                                 'UTR':'fancy_id'}, transform=transform, force=True, verbose=True)
olgabot commented 10 years ago

Read your post more carefully and am now trying with infer_gene_extent=False, with the transform. I'll let you know how it goes.

daler commented 10 years ago

Hmm, I bet it will still be slow. It's going to try and merge together all the attributes of features with the same ID, of which there will be many using this transform. In the first 10k lines, it looks like there are quite a few features that will have many duplicate IDs:

cut -f 1,3-5 ../gencode.v19.annotation.gtf \
  | head -n 10000 \
  | sort -k1,1 -nk2,2 -nk3,3 \
  | uniq -c \
  | sort -nrk1 \
  | head -n 10

#     23 chr1   exon    1256376 1256473
#     21 chr1   exon    1559154 1559325
#     20 chr1   exon    1248889 1248972
#     19 chr1   exon    2066701 2066786
#     19 chr1   exon    1563869 1564102
#     19 chr1   exon    1560175 1560281
#     19 chr1   exon    1255836 1255909
#     19 chr1   exon    1250900 1250998
#     19 chr1   exon    1250784 1250818
#     18 chr1   exon    2080311 2080363

Or more precisely, to get an idea of the number of name collisions:

cut -f 1,3-5 gencode.v19.annotation.gtf \
  | grep -vE "gene|transcript" \
  | head -n 10000 \
  | sort -k1,1 -nk2,2 -nk3,3 \
  | uniq -c \
  | awk 'BEGIN {count=1}{if ($1 > 1) count +=$1-1;} END {print count}'
# 5344

If I got this right, more than half (!) of all non-gene, non-transcript features will have name collisions that will need to be merged. I'm not sure there's a way to speed this up.

Anyway, the result should be a nice clean db with unique feature IDs for each genomic region, and each region will have a list of transcript_id parents . . . but it will take some time.

olgabot commented 10 years ago

Yes, there's a lot of ID collisions when it comes to exons in hg19. But that's exactly what I want - I want each exon to be part of multiple transcripts of a single gene.

I'm trying it out with 'exon': 'exon_id' in the id_spec instead of using the transform, and it seems to be taking just as long as the transform. So yes, it's the merging step that's taking the longest.

yarden commented 10 years ago

This is a minor point but it seems very inefficient/unnecessary to create two tiny set() elements each time transform gets called.

Sent from a mobile device

On Dec 30, 2013, at 4:04 PM, Olga Botvinnik notifications@github.com wrote:

Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..

Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf' v19db_filename = v19gff + '.test_no_fancy_id.db'

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True) But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf' v19db_filename = v19gff + '.test.db'

def transform(f): features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id'] if set([f.featuretype]) & set(('gene', 'transcript')): return f else: exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand) exon_id = exon_location f.attributes['fancy_id'] = [exon_id] return f

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id', 'exon':'fancy_id', 'CDS':'fancy_id', 'start_codon':'fancy_id', 'stop_codon':'fancy_id', 'UTR':'fancy_id'}, transform=transform, force=True, verbose=True) — Reply to this email directly or view it on GitHub.

olgabot commented 10 years ago

Ah yes good point. I'll change that in the code.


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory | Sanford Consortium for Regenerative Medicine University of California, San Diego olgabotvinnik.com blog.olgabotvinnik.com github.com/olgabot

On Mon, Dec 30, 2013 at 1:59 PM, Yarden Katz notifications@github.comwrote:

This is a minor point but it seems very inefficient/unnecessary to create two tiny set() elements each time transform gets called.

Sent from a mobile device

On Dec 30, 2013, at 4:04 PM, Olga Botvinnik notifications@github.com wrote:

Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..

Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'

v19db_filename = v19gff + '.test_no_fancy_id.db'

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True) But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?

import gffutils

v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'

v19db_filename = v19gff + '.test.db'

def transform(f): features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id'] if set([f.featuretype]) & set(('gene', 'transcript')): return f else: exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand) exon_id = exon_location f.attributes['fancy_id'] = [exon_id] return f

v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id', 'exon':'fancy_id', 'CDS':'fancy_id', 'start_codon':'fancy_id', 'stop_codon':'fancy_id', 'UTR':'fancy_id'}, transform=transform, force=True, verbose=True) — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/daler/gffutils/issues/20#issuecomment-31371513 .

daler commented 10 years ago

The "merge" merge strategy starts here: https://github.com/daler/gffutils/blob/master/gffutils/create.py#L175

Briefly, this method:

This "previously-observed duplicates" step was added in 030737ba3a54132fc693885e72d28dc4762d3c70. It makes duplicate-finding much more robust, but I suspect it's slowing things down. Time for some tests to see if it's possible to disable in some cases . . .

daler commented 10 years ago

OK, I figured out what's causing the slowdown -- in this Gencode GTF, exons can have the same chrom/start/stop/strand but have different sources, like these two exons:

chr1    HAVANA  exon    12613   12721   .   +   .   transcript_status "KNOWN"; gene_status "KNOWN"; havana_gene "OTTHUMG00000000961.2"; level "2"; transcript_type "processed_transcript"; tag "basic"; gene_id "ENSG00000223972.4"; exon_id "ENSE00003582793.1"; exon_number "2"; fancy_id "exon:chr1:12613-12721:+"; havana_transcript "OTTHUMT00000362751.1"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";
chr1    ENSEMBL exon    12613   12721   .   +   .   gene_status "KNOWN"; exon_number "2"; level "3"; transcript_type "transcribed_unprocessed_pseudogene"; gene_id "ENSG00000223972.4"; exon_id "ENSE00003608237.1"; transcript_id "ENST00000515242.2"; fancy_id "exon:chr1:12613-12721:+"; havana_gene "OTTHUMG00000000961.2"; transcript_name "DDX11L1-201"; gene_type "pseudogene"; transcript_status "KNOWN"; gene_name "DDX11L1";

Currently, merging in gffutils requires all fields except the attributes field to be identical (chrom, source, start, stop, etc). Here they are not since the first source is HAVANA, second one is ENSEMBL. So gffutils treats them as different features that evaluate to the same ID spec. It then gives the second feature an auto-incremented primary key for the db, so you can still access both features separately. This adds to bloat in the duplicates table, slows things down, and in the end does not give you the db that I think you're going for because the second feature will have an ID of exon:chr1:12613-12721:+_1.

One solution is to modify your transform func to set all sources to be the same. I tried this, and it speeds things up a little . . . but then realized there are other cases where exons have the same chrom/start/stop/strand, but different frames. If you don't care about frame, you can set those to all be "." as well in your transform func. Or you can add frame to the fancy ID. Both of these fix the slowdown, creating a merged db in ~17 mins.

Alternatively, I can add a mechanism for specifying which non-attributes fields should be ignored when doing a merge. Maybe a force_merge_fields kwarg to create_db(). In this case, you'd specify force_merge_fields=["source", "frame"] to indicate feature with differing sources or frames should still be merged.

This leads to a design question: how should the merged fields be reported? My initial thought is a comma-separated list, like "HAVANA,ENSEMBL". But then how to merge exons with different frames? Comma-separated frames aren't supported by the GFF/GTF spec, and may break downstream code. Any suggestions?

olgabot commented 10 years ago

Hmm, I hadn't even noticed the different source cases! However, it seems as though this exon is part of different transcripts, because for the HAVANA version the transcript id is ENST00000456328.2 and the ENSEMBL version is ENST00000515242.2, so and since the frame is the same, the exon ID by ENSEMBL is the same (ENSE00003582793.1) so ultimately this is the same locus, but different sources. So I think a list like sources = ['HAVANA', 'ENSEMBL'] is just fine.

I care a lot about frame so I'd rather not remove it. Can frame be a list like the values in attributes? Though somehow those values would need to be tied to which transcripts produce one frame or another, so it seems a better idea to include frame in the ID, e.g. CDS:chr1:12613-12721:+:. (since frame is only specified for CDS features).

Also, is it possible for CDS features to be children of exons?

daler commented 10 years ago

I'll work on merging non-attribute fields.

For the reasons you point out, I agree that it's best to add frame to the custom ID in this case.

As for your question,

Also, is it possible for CDS features to be children of exons?

Probably . . . would you define this based on coordinates alone? Or I suppose the custom IDs, where exon:chr1:100-200:+:. is the parent of CDS:chr1:100-200:+:2? This would be sort of a retrofit on the database after it's created. All you need to do is add parent, child, and level to the relations table. I can add a helper method to do this.

I'll create separate issues for these.

olgabot commented 10 years ago

That would be great, thanks! I'll add the frame to the ID as well.

re: CDS - there are a couple of issues:

  1. CDS's can be shorter than the exons, e.g. in UTRs, so going off of coordinates alone won't work, even though the exon IDs will be the same.
  2. Which brings me to... In ENSEMBL, exons have different ENSE ids if they are of different frames. Ideally, the same exon locus exon:chr1:100-200:+:. would be a parent to both CDS:chr1:100-200:+:2 and CDS:chr1:100-150:+:0. So somehow the parent exon locus would have to match up with the different ENSE ids of the CDSs
daler commented 10 years ago

I created issues #27 and #28 for this; can you take a look and comment?

olgabot commented 10 years ago

This is still taking a while for me.. on our cluster, 1 node, 1 processor (16g ram), I'm only 1mil features in after about a day (currently Fri Jan 3 21:01:ish) :

:import gffutils
:
:v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
:v19db_filename = v19gff + '.db'
:
:gene_transcript = set(('gene', 'transcript'))
:
:def transform(f):
:    if f.featuretype in gene_transcript:
:        return f
:    else:
:        exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand)
:        exon_id = exon_location
:        if f.featuretype == 'CDS':
:            exon_id += ':' + f.frame
:        f.attributes['fancy_id'] = [exon_id]
:        return f

In [2]: ! date
Thu Jan  2 21:37:58 PST 2014

In [3]: 

In [3]: %time v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id','exon':'fancy_id', 'CDS':'fancy_id','start_codon':'fancy_id', 'stop_codon':'fancy_id','UTR':'fancy_id'}, transform=transform, force=True, verbose=True, infer_gene_extent=False, force_merge_fields=['source'])
Populating features table and first-order relations: 949000 features

Any suggestions?

(btw haven't tried out the add relations stuff yet - still trying to get this to work)

daler commented 10 years ago

Hmm, that is strange. It's taking me about 20 mins on my desktop or on a cluster node. I checked the latter in case it was a file system problem, but it doesn't seem to be. What version of sqlite are you running? The db uses WAL journaling mode if you're running >3.7.0 (http://www.sqlite.org/wal.html) which apparently can have drastic speed improvements, though I haven't quantified that.

I can send you the db if you'd like -- I copied your code into the script here and ran it: https://gist.github.com/daler/8259098

olgabot commented 10 years ago

Thanks for the offer - could you send it to obotvinn@ucsd.edu?

I'd still like to figure this out for future cases. The version of sqlite we have is 3.6.20, so I'll try upgrading to 3.7 (and hopefully not break anything....) and see if that improves it.

In [3]: import sqlite3

In [4]: sqlite3.version
Out[4]: '2.6.0'

In [5]: sqlite3.sqlite_version
Out[5]: '3.6.20'
daler commented 10 years ago

Hi Olga - As of 1ac2d0c893bd4e063627364c39f34d37c173b4c1, the journal_mode is now MEMORY (see #33) instead of the WAL that was giving you trouble on sqlite <3.7. If you're still having issues with large files on older sqlite3 versions, could you see if this now works for you?

olgabot commented 10 years ago

Just wanted to follow up - my project is sinusoidal in its requirement for gffutils and it's now in a "peak" again. This seems to work! The Gencode V19 db took 26 min to create with the previous code, so looks like it's working well!

daler commented 10 years ago

Hah, sinusoidal. Glad it's working now, and thanks for the follow-up.

mlovci commented 9 years ago

Hi @daler and @olgabot

I tried the script here: https://gist.github.com/daler/8259098 on gencode v19 and it took 6 days and ended in an sqlite3 error. Unfortunate.

I changed some parameters until it started working better (biology style!) and found that this combination of parameters for create_db works in about 30 minutes on my system:

gffutils.create_db("gencode.v19.annotation.gtf","gencode.v19.annotation.db", keep_order=False, sort_attribute_values=False, id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, verbose=True, merge_strategy='merge', infer_gene_extent=False,)

Hopefully this helps someone trying to index their own database. I'm not sure whether it's the transform or the extra fields in id_spec that are causing the slowdown or even something else.

Best, Mike

daler commented 9 years ago

@mlovci - yikes, 6 days is unacceptable. In order for me to troubleshoot this can you provide some more info? Specifically,

1) Did you use the exact file from

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz?

If not, can you send a link to the exact file you used?

2) What OS are you using?

3) What version of sqlite do you have installed? Use

import sqlite3
print sqlite3.version
print sqlite3.sqlite_version

Thanks.

mlovci commented 9 years ago

1) Did you use the exact file from

yes

2) What OS are you using?

Ubuntu Intel(R) Xeon(R) CPU @ 2.40GHz 32 GB RAM

3) What version of sqlite do you have installed? Use

In [1]: import sqlite3

In [2]: print sqlite3.version 2.6.0

In [3]: print sqlite3.sqlite_version 3.8.4.1

This isn't urgent, the new parameters are working for me, but thanks for checking.

daler commented 9 years ago

Thanks. I also forgot to ask, what version of gffutils are you using and how did you install it?

It's great that you got it working using the new parameters (and thanks for reporting it), but I really should figure out exactly why it was taking so long for you in the first place.

mlovci commented 9 years ago
>>> gffutils.__version__
'0.8.2'

I installed with plain old pip install gffutils first, then when that didn't work quickly I did:

git clone https://github.com/daler/gffutils.git
pip uninstall gffutils
cd gffutils
pip install .
daler commented 9 years ago

OK. I'll look into this some more and report back.

pavo commented 8 years ago

Any luck with this? I am trying to graduate from the examples to the human genome (eg ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz) and some ChipSeq data downloaded from the SRA. Do you have any advice on a dataset or database to try? I don't mind waiting several days for this but hoped I could get started on something in the meantime. Especially if it might crash and I would have to start over again -- the original 1.4GB gtf file pretty rapidly gets converted to a 2.3GB db file, but then the db file stays that size and the single CPU keeps chugging away at 99% utilization for hours (the 30 other CPUs stay at 0%). Memory utilization on my system peaks at about 0.1%. I don't have an indication if the process is going well or if it has crashed or how long it might take for that matter.

daler commented 8 years ago

I was never able to reproduce the error. It could have been a hardware-related issue, since sqlite3 can be slow on NFS filesystems. I will run the 84 Ensembl GTF file and report back here.

daler commented 8 years ago

Also, you mention ChIP-seq data -- is that related to this db creation issue in some way?

daler commented 8 years ago

Are you using the disable_infer_genes and disable_infer_transcripts kwargs? The following code:

# using:
# http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr.gtf.gz
import gffutils
gffutils.create_db(
    'Homo_sapiens.GRCh38.84.chr.gtf.gz',
    'Homo_sapiens.GRCh38.84.chr.gtf.db',
    disable_infer_genes=True,
    disable_infer_transcripts=True)

took under half an hour:

real    24m56.864s
user    18m54.308s
sys     0m39.158s

Can you provide the exact commands you're using?

pavo commented 8 years ago

I did this:

import gffutils
fn = "Homo_sapiens.GRCh38.84.chr.gtf"  # this was gunzipped into the working directory
db = gffutils.create_db(fn, dbfn='test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)

It looks like my processes finished overnight. I did not set disable_infer_genes=True and disable_infer_transcripts=True as you did in your code. I naively thought that I would want the gene information and transcript information, so I had kept these on. Is it ok to use the databases I created overnight?

Any idea why with my code the database size ramped up quickly to a 2.3GB db then basically remained the same size while the CPU worked all night? It sounds like inferring genes and transcripts is what took such a long time so was wondering what that computation is (a sort operation?). The final database size this morning is 2.4GB, not much different from the 2.3GB it ramped up to in about 20-30 minutes. I thought that the create_db() function was simply packaging the GTF file's column data into a sql database but clearly there is more computation going on!

daler commented 8 years ago

You want disable_infer_transcripts=True and disable_infer_genes=True for this file. Ensembl's GTFs are not true GTFs since they contain lines for genes and transcripts already. If you're using a recent version of gffutils, you should get a warning saying that there were gene/transcripts in your GTF already and that it could take a while if you don't disable inferring genes/transcripts.

Putting the GTF into the features table and relations table is just the first step. When gffutils detects a GTF file, it does a lot of work to infer transcripts and genes. After it infers each gene or transcript, it sees that there is already a transcript or gene in the database with the same ID. That's because this file contains genes and transcripts, so they've already been imported by this point. Then gffutils does more work to merge the information from the gene or transcript it just inferred with the one that already exists in the database.

So the extra time is due to merge operations caused by name collisions, which in turn were caused from inferring genes and transcripts that already exist in the GTF file. I'd re-run using those disable_infer_* args. It shouldn't take long, and if it takes longer than an hour let me know.

iriet commented 7 years ago

Hi @daler,

I am trying to use gffutils to work with the GRCh38 RefSeq annotations, and the database creation is taking quite some time. I am hoping you could throw us a few tips on options we should be flipping on/off. The goal of our use case is to parse exon coordinates for alternative splice variants, referenced to their genomic locations. We would like to keep the affiliations of alternate variants with their primaries, and coordinate annotations for 5'/3' UTRs too, ideally.

The gff file was obtained from here:

ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz

Our attempts ran for a couple of hours on my student's machine before we aborted the process. Then, I also tried it for an hour before aborting, on my (newer) 2.9 GHz Intel Core i7 laptop. The latter is running macOS Sierra 10.12.5 with gffutils version 0.8.7.1 which was installed using conda version 4.3.18. The python is 3.6.0 and the sqlite is version 3.13.0.

thanks for any help, Takeshi

daler commented 7 years ago

It should take <15 mins. Are you using disable_infer_genes=True and disable_infer_transcripts=True as described above?

daler commented 7 years ago

Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.

Databases need a unique ID for each feature. The ID field in this file is not unique -- for example here are the features for ID=cds1:

NC_000001.11    Gnomon  CDS 182709  182746  .   +   0   ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11    Gnomon  CDS 183114  183240  .   +   1   ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11    Gnomon  CDS 183922  184158  .   +   0   ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1

If you use merge_strategy="merge", then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.

You'll need to decide how you want to be able to refer to CDSs. If you don't really have an opinion on that, you can try the merge_strategy="create_unique" argument when creating the database which should speed things up considerably. The features above will then be called cds1, cds1.1, and cds1.2. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.

Give merge_strategy="create_unique" a try to see if it helps the speed issue. It still should run in <15 mins.

iriet commented 7 years ago

The merge_strategy="create_unique" worked well for what we want. Thank you kindly for the pointer.

Best Takeshi

On May 25, 2017, at 10:02 AM, Ryan Dale notifications@github.com wrote:

Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.

Databases need a unique ID for each feature. The ID field in this file is not unique -- for example here are the features for ID=cds1:

NC_000001.11 Gnomon CDS 182709 182746 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183114 183240 . + 1 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183922 184158 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 If you use merge_strategy="merge", then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.

You'll need to decide how you want to be able to refer to CDSs. If you don't really have an opinion on that, you can try the merge_strategy="create_unique" argument when creating the database which should speed things up considerably. The features above will then be called cds1, cds1.1, and cds1.2. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.

Give merge_strategy="create_unique" a try to see if it helps the speed issue. It still should run in <15 mins.

― You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.