Closed yarden closed 7 years ago
Yeah, there's a LOT of optimization that can be done, especially in the refactor branch!
If you do run -p scriptname.py
in IPython, you can profile the code. In a lot of my tests, the _split_keyvals()
function is the rate-limiting step. The problem here is that this function does a lot of the dialect detection -- essentially trading speed for flexibility in input format.
Restructuring _split_keyvals()
, or re-writing in Cython, would be a good start to speeding things up.
More directly related to the db import, there are actually 2 code paths in _GFFDBCreator.populate_from_lines()
where I was doing some profiling (I left the comments in there with some results).
So: yes, lots of optimization to be done, it just hasn't been a priority in the refactor branch.
After some profiling work, I found out that OrderedDicts are pretty slow. Here's a script to profile different implementations: https://gist.github.com/daler/5588154. It tries to mimic attributes of a GFF file being parsed into a dictionary.
Short answer: changing over from the current DefaultOrderedDict
to a naive ordered dict gains 2.5 to 5.7x speed increase (!). Details below:
setdefault
to create an empty list if the key doesn't existcollections.defaultdict
. Somewhat faster than dictcollections.OrderedDict
. Sloooooow.gffutils.helpers.DefaultOrderedList
. Also slow.run -t -N5 dictprofile.py <flavor>
. nlines = 200000
nkeys = 5
vals_per_key = 3
# makes this dictionary 200K times:
{0: [0, 1, 2], 1: [0, 1, 2], 2: [0, 1, 2], 3: [0, 1, 2], 4: [0, 1, 2]}
nlines = 200000
nkeys = 1
vals_per_key = 1
# makes this dictionary 200K times:
{0: [0]}
Interesting! I wonder if some sort of arrangement with named tuples (or a totally Cythonized data structure instead) would be best. The dictionaries belonging to particular records in GFFs are always very small (i.e. few keys), so the overhead of a hash is not always justified.
Is it correct to assume that the core code still uses defaultdicts? With the new commits, I'm not seeing a difference in speed for operations like data base creation, like in this snippet from helpers.py
:
# Return sanitized GFF database
sanitized_db = \
gffutils.create_db(sanitized_iterator(), ":memory:",
verbose=False)
In my analysis that's mostly where the bottleneck is. --Yarden
Right, I haven't changed anything yet in response to this new info. I'll look into named tuples -- I didn't think of that, thanks. Good point about the hash overhead . . . time for some more benchmarks.
Hi Yarden - I've made some optimizations resulting in better performance than even the original master branch. Using http://genes.mit.edu/yarden/Mus_musculus.NCBIM37.65.gff:
refactor
branch (as of 0e584d2384f572cd2980b44ffc0973354ab92ec1)
$ time gffutils-cli create Mus_musculus.NCBIM37.65.gff
real 3m45.298s
user 2m40.716s
sys 0m12.088s
master
branch
$ time gffutils-cli create Mus_musculus.NCBIM37.65.gff
real 4m55.100s
user 2m14.044s
sys 0m12.908s
Most of this came from converting the attributes of each feature into a plain ol' dictionary. But I wanted to maintain the information about the order of attributes within each feature so that GFF -> db -> GFF gets you exactly the same GFF that you put into the db in the first place. That order information is now stored once for the entire db in the dialect
.
Can you test and see if it's a reasonable speed?
Related, is it possible to parallelize the code? I'm making a FeatureDB
from Gencode v19 and it's taking quite a long time. It would be great if the code was parallelizable and the database creation could be farmed out to a cluster.
Hi Olga -
sqlite doesn't allow concurrent writes, so each cluster node would have to create their own db, and then all these databases would need to be merged back together. Since the GFF/GTF format allows features to be in any order, it's possible to have pathological cases where two exons for a transcript are on the first line and last line of the file respectively. I think this means that each db would need to be read in its entirety in order to ensure all parent/child relations are accounted for.
I don't have a good feel for how this would compare to going through the file on a single core, or what the optimal number of cluster nodes would be. My hunch is that creating dbs in parallel, and then using sqlite3 to sort and merge may be faster than a single CPU. I'll have to work on tests for this.
I mostly use GFF, and I'm happy with the current speed (@yarden, thoughts on this?). I have not spent as much time optimizing GTF, so I think the first step will be to see if I can get GTF on a single node to work at a reasonable speed.
I downloaded the Gencode v19 and tried creating a FeatureDB
. It was unacceptably slow, and I killed it after 2 hrs. I made some changes to gffutils
, and it runs in about 12 mins now (see below).
For future reference, there are two reasons for the speedup. First, I added temporary indexes just prior to inferring gene and transcript extents for GTF files in _GTFDBCreator._update_relations()
CREATE INDEX relationsparent ON relations (parent);
CREATE INDEX relationschild ON relations (child);
This sped up the nested subqueries for inferring gene extent. But I noticed that importing those inferred features was taking a long time, and seemed like it was hitting a lot of ID collisions.
Looking more closely, the Gencode v19 GTF file already has genes and transcripts inferred. So gffutils
was spending a lot of effort to infer the extents of genes and transcripts. Then it would re-import them . . . only to find that they already existed in the database, which meant more effort figuring out how to merge them!
So the second reason it's faster now is that I added a infer_gene_extent
kwarg to gffutils.create_db()
, and a corresponding --disable-infer
argument to the gffutils-cli
commandline tool.
With these new changes (as of 4cb0cb810887cb24d81f941bfb2d5fcc6947f4a2), it takes about 12 min to create a database out of the 2.6 million line Gencode v19 GTF file:
$ time gffutils-cli create gencode.v19.annotation.gtf --disable-infer
# real 11m38.776s
# user 9m44.689s
# sys 1m13.821s
Would you try this out to see if it works for you? Does this seem like a reasonable amount of time?
Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..
Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test_no_fancy_id.db'
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge',
id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True)
But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop
, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test.db'
def transform(f):
features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id']
if set([f.featuretype]) & set(('gene', 'transcript')):
return f
else:
exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand)
exon_id = exon_location
f.attributes['fancy_id'] = [exon_id]
return f
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge',
id_spec={'gene': 'gene_id', 'transcript': 'transcript_id',
'exon':'fancy_id', 'CDS':'fancy_id',
'start_codon':'fancy_id', 'stop_codon':'fancy_id',
'UTR':'fancy_id'}, transform=transform, force=True, verbose=True)
Read your post more carefully and am now trying with infer_gene_extent=False
, with the transform. I'll let you know how it goes.
Hmm, I bet it will still be slow. It's going to try and merge together all the attributes of features with the same ID, of which there will be many using this transform. In the first 10k lines, it looks like there are quite a few features that will have many duplicate IDs:
cut -f 1,3-5 ../gencode.v19.annotation.gtf \
| head -n 10000 \
| sort -k1,1 -nk2,2 -nk3,3 \
| uniq -c \
| sort -nrk1 \
| head -n 10
# 23 chr1 exon 1256376 1256473
# 21 chr1 exon 1559154 1559325
# 20 chr1 exon 1248889 1248972
# 19 chr1 exon 2066701 2066786
# 19 chr1 exon 1563869 1564102
# 19 chr1 exon 1560175 1560281
# 19 chr1 exon 1255836 1255909
# 19 chr1 exon 1250900 1250998
# 19 chr1 exon 1250784 1250818
# 18 chr1 exon 2080311 2080363
Or more precisely, to get an idea of the number of name collisions:
cut -f 1,3-5 gencode.v19.annotation.gtf \
| grep -vE "gene|transcript" \
| head -n 10000 \
| sort -k1,1 -nk2,2 -nk3,3 \
| uniq -c \
| awk 'BEGIN {count=1}{if ($1 > 1) count +=$1-1;} END {print count}'
# 5344
If I got this right, more than half (!) of all non-gene, non-transcript features will have name collisions that will need to be merged. I'm not sure there's a way to speed this up.
Anyway, the result should be a nice clean db with unique feature IDs for each genomic region, and each region will have a list of transcript_id parents . . . but it will take some time.
Yes, there's a lot of ID collisions when it comes to exons in hg19. But that's exactly what I want - I want each exon to be part of multiple transcripts of a single gene.
I'm trying it out with 'exon': 'exon_id'
in the id_spec
instead of using the transform, and it seems to be taking just as long as the transform. So yes, it's the merging step that's taking the longest.
This is a minor point but it seems very inefficient/unnecessary to create two tiny set() elements each time transform gets called.
Sent from a mobile device
On Dec 30, 2013, at 4:04 PM, Olga Botvinnik notifications@github.com wrote:
Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..
Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf' v19db_filename = v19gff + '.test_no_fancy_id.db'
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True) But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf' v19db_filename = v19gff + '.test.db'
def transform(f): features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id'] if set([f.featuretype]) & set(('gene', 'transcript')): return f else: exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand) exon_id = exon_location f.attributes['fancy_id'] = [exon_id] return f
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id', 'exon':'fancy_id', 'CDS':'fancy_id', 'start_codon':'fancy_id', 'stop_codon':'fancy_id', 'UTR':'fancy_id'}, transform=transform, force=True, verbose=True) — Reply to this email directly or view it on GitHub.
Ah yes good point. I'll change that in the code.
Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory | Sanford Consortium for Regenerative Medicine University of California, San Diego olgabotvinnik.com blog.olgabotvinnik.com github.com/olgabot
On Mon, Dec 30, 2013 at 1:59 PM, Yarden Katz notifications@github.comwrote:
This is a minor point but it seems very inefficient/unnecessary to create two tiny set() elements each time transform gets called.
Sent from a mobile device
On Dec 30, 2013, at 4:04 PM, Olga Botvinnik notifications@github.com wrote:
Thank you for the response! I had been trying to make a Gencode db for a week now but it kept taking a really long time. My current run is 51 hours on the cluster..
Without a transform, those improvements are quite fast. At about 6 minutes in on the cluster, it's halfway done (~1.3mil/2.6mil features), close to your 12 min benchmark:
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test_no_fancy_id.db'
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, force=True, verbose=True) But I'd like to transform exons, CDS, stop/start codons (basically everything except gene and transcript) IDs to be feature:chrom:start-stop, e.g. 'CDS:chr1:555555-666666`, rather than numbered, and this is still slow. (8 min, 47k/2.6mil features). Do you have any suggestions?
import gffutils
v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
v19db_filename = v19gff + '.test.db'
def transform(f): features = ['gene_name', 'exon_number', 'gene_id', 'transcript_id', 'exon_id'] if set([f.featuretype]) & set(('gene', 'transcript')): return f else: exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand) exon_id = exon_location f.attributes['fancy_id'] = [exon_id] return f
v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id', 'exon':'fancy_id', 'CDS':'fancy_id', 'start_codon':'fancy_id', 'stop_codon':'fancy_id', 'UTR':'fancy_id'}, transform=transform, force=True, verbose=True) — Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHubhttps://github.com/daler/gffutils/issues/20#issuecomment-31371513 .
The "merge" merge strategy starts here: https://github.com/daler/gffutils/blob/master/gffutils/create.py#L175
Briefly, this method:
_DBCreator._candidate_merges()
This "previously-observed duplicates" step was added in 030737ba3a54132fc693885e72d28dc4762d3c70. It makes duplicate-finding much more robust, but I suspect it's slowing things down. Time for some tests to see if it's possible to disable in some cases . . .
OK, I figured out what's causing the slowdown -- in this Gencode GTF, exons can have the same chrom/start/stop/strand but have different sources, like these two exons:
chr1 HAVANA exon 12613 12721 . + . transcript_status "KNOWN"; gene_status "KNOWN"; havana_gene "OTTHUMG00000000961.2"; level "2"; transcript_type "processed_transcript"; tag "basic"; gene_id "ENSG00000223972.4"; exon_id "ENSE00003582793.1"; exon_number "2"; fancy_id "exon:chr1:12613-12721:+"; havana_transcript "OTTHUMT00000362751.1"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";
chr1 ENSEMBL exon 12613 12721 . + . gene_status "KNOWN"; exon_number "2"; level "3"; transcript_type "transcribed_unprocessed_pseudogene"; gene_id "ENSG00000223972.4"; exon_id "ENSE00003608237.1"; transcript_id "ENST00000515242.2"; fancy_id "exon:chr1:12613-12721:+"; havana_gene "OTTHUMG00000000961.2"; transcript_name "DDX11L1-201"; gene_type "pseudogene"; transcript_status "KNOWN"; gene_name "DDX11L1";
Currently, merging in gffutils
requires all fields except the attributes field to be identical (chrom, source, start, stop, etc). Here they are not since the first source is HAVANA, second one is ENSEMBL. So gffutils
treats them as different features that evaluate to the same ID spec. It then gives the second feature an auto-incremented primary key for the db, so you can still access both features separately. This adds to bloat in the duplicates table, slows things down, and in the end does not give you the db that I think you're going for because the second feature will have an ID of exon:chr1:12613-12721:+_1
.
One solution is to modify your transform func to set all sources to be the same. I tried this, and it speeds things up a little . . . but then realized there are other cases where exons have the same chrom/start/stop/strand, but different frames. If you don't care about frame, you can set those to all be "." as well in your transform func. Or you can add frame to the fancy ID. Both of these fix the slowdown, creating a merged db in ~17 mins.
Alternatively, I can add a mechanism for specifying which non-attributes fields should be ignored when doing a merge. Maybe a force_merge_fields
kwarg to create_db()
. In this case, you'd specify force_merge_fields=["source", "frame"]
to indicate feature with differing sources or frames should still be merged.
This leads to a design question: how should the merged fields be reported? My initial thought is a comma-separated list, like "HAVANA,ENSEMBL". But then how to merge exons with different frames? Comma-separated frames aren't supported by the GFF/GTF spec, and may break downstream code. Any suggestions?
Hmm, I hadn't even noticed the different source cases! However, it seems as though this exon is part of different transcripts, because for the HAVANA version the transcript id is ENST00000456328.2
and the ENSEMBL version is ENST00000515242.2
, so and since the frame is the same, the exon ID by ENSEMBL is the same (ENSE00003582793.1
) so ultimately this is the same locus, but different sources. So I think a list like sources = ['HAVANA', 'ENSEMBL']
is just fine.
I care a lot about frame so I'd rather not remove it. Can frame
be a list like the values
in attributes
? Though somehow those values would need to be tied to which transcripts produce one frame or another, so it seems a better idea to include frame
in the ID, e.g. CDS:chr1:12613-12721:+:.
(since frame
is only specified for CDS
features).
Also, is it possible for CDS features to be children of exons?
I'll work on merging non-attribute fields.
For the reasons you point out, I agree that it's best to add frame
to the custom ID in this case.
As for your question,
Also, is it possible for CDS features to be children of exons?
Probably . . . would you define this based on coordinates alone? Or I suppose the custom IDs, where exon:chr1:100-200:+:.
is the parent of CDS:chr1:100-200:+:2
? This would be sort of a retrofit on the database after it's created. All you need to do is add parent, child, and level to the relations
table. I can add a helper method to do this.
I'll create separate issues for these.
That would be great, thanks! I'll add the frame
to the ID as well.
re: CDS - there are a couple of issues:
ENSE
ids if they are of different frames. Ideally, the same exon locus exon:chr1:100-200:+:.
would be a parent to both CDS:chr1:100-200:+:2
and CDS:chr1:100-150:+:0
. So somehow the parent exon locus would have to match up with the different ENSE
ids of the CDS
sI created issues #27 and #28 for this; can you take a look and comment?
This is still taking a while for me.. on our cluster, 1 node, 1 processor (16g ram), I'm only 1mil features in after about a day (currently Fri Jan 3 21:01:ish) :
:import gffutils
:
:v19gff = '/home/obotvinnik/scratch/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf'
:v19db_filename = v19gff + '.db'
:
:gene_transcript = set(('gene', 'transcript'))
:
:def transform(f):
: if f.featuretype in gene_transcript:
: return f
: else:
: exon_location = '{}:{}:{}-{}:{}'.format(f.featuretype, f.seqid, f.start, f.stop, f.strand)
: exon_id = exon_location
: if f.featuretype == 'CDS':
: exon_id += ':' + f.frame
: f.attributes['fancy_id'] = [exon_id]
: return f
In [2]: ! date
Thu Jan 2 21:37:58 PST 2014
In [3]:
In [3]: %time v19db = gffutils.create_db(v19gff, v19db_filename, merge_strategy='merge', id_spec={'gene': 'gene_id', 'transcript': 'transcript_id','exon':'fancy_id', 'CDS':'fancy_id','start_codon':'fancy_id', 'stop_codon':'fancy_id','UTR':'fancy_id'}, transform=transform, force=True, verbose=True, infer_gene_extent=False, force_merge_fields=['source'])
Populating features table and first-order relations: 949000 features
Any suggestions?
(btw haven't tried out the add relations stuff yet - still trying to get this to work)
Hmm, that is strange. It's taking me about 20 mins on my desktop or on a cluster node. I checked the latter in case it was a file system problem, but it doesn't seem to be. What version of sqlite are you running? The db uses WAL journaling mode if you're running >3.7.0 (http://www.sqlite.org/wal.html) which apparently can have drastic speed improvements, though I haven't quantified that.
I can send you the db if you'd like -- I copied your code into the script here and ran it: https://gist.github.com/daler/8259098
Thanks for the offer - could you send it to obotvinn@ucsd.edu?
I'd still like to figure this out for future cases. The version of sqlite we have is 3.6.20, so I'll try upgrading to 3.7 (and hopefully not break anything....) and see if that improves it.
In [3]: import sqlite3
In [4]: sqlite3.version
Out[4]: '2.6.0'
In [5]: sqlite3.sqlite_version
Out[5]: '3.6.20'
Hi Olga - As of 1ac2d0c893bd4e063627364c39f34d37c173b4c1, the journal_mode is now MEMORY (see #33) instead of the WAL that was giving you trouble on sqlite <3.7. If you're still having issues with large files on older sqlite3 versions, could you see if this now works for you?
Just wanted to follow up - my project is sinusoidal in its requirement for gffutils
and it's now in a "peak" again. This seems to work! The Gencode V19 db took 26 min to create with the previous code, so looks like it's working well!
Hah, sinusoidal. Glad it's working now, and thanks for the follow-up.
Hi @daler and @olgabot
I tried the script here: https://gist.github.com/daler/8259098 on gencode v19 and it took 6 days and ended in an sqlite3 error. Unfortunate.
I changed some parameters until it started working better (biology style!) and found that this combination of parameters for create_db
works in about 30 minutes on my system:
gffutils.create_db("gencode.v19.annotation.gtf","gencode.v19.annotation.db", keep_order=False, sort_attribute_values=False, id_spec={'gene': 'gene_id', 'transcript': 'transcript_id'}, verbose=True, merge_strategy='merge', infer_gene_extent=False,)
Hopefully this helps someone trying to index their own database. I'm not sure whether it's the transform or the extra fields in id_spec that are causing the slowdown or even something else.
Best, Mike
@mlovci - yikes, 6 days is unacceptable. In order for me to troubleshoot this can you provide some more info? Specifically,
1) Did you use the exact file from
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz?
If not, can you send a link to the exact file you used?
2) What OS are you using?
3) What version of sqlite do you have installed? Use
import sqlite3
print sqlite3.version
print sqlite3.sqlite_version
Thanks.
1) Did you use the exact file from
yes
2) What OS are you using?
Ubuntu Intel(R) Xeon(R) CPU @ 2.40GHz 32 GB RAM
3) What version of sqlite do you have installed? Use
In [1]: import sqlite3
In [2]: print sqlite3.version 2.6.0
In [3]: print sqlite3.sqlite_version 3.8.4.1
This isn't urgent, the new parameters are working for me, but thanks for checking.
Thanks. I also forgot to ask, what version of gffutils
are you using and how did you install it?
It's great that you got it working using the new parameters (and thanks for reporting it), but I really should figure out exactly why it was taking so long for you in the first place.
>>> gffutils.__version__
'0.8.2'
I installed with plain old pip install gffutils
first, then when that didn't work quickly I did:
git clone https://github.com/daler/gffutils.git
pip uninstall gffutils
cd gffutils
pip install .
OK. I'll look into this some more and report back.
Any luck with this? I am trying to graduate from the examples to the human genome (eg ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz) and some ChipSeq data downloaded from the SRA. Do you have any advice on a dataset or database to try? I don't mind waiting several days for this but hoped I could get started on something in the meantime. Especially if it might crash and I would have to start over again -- the original 1.4GB gtf file pretty rapidly gets converted to a 2.3GB db file, but then the db file stays that size and the single CPU keeps chugging away at 99% utilization for hours (the 30 other CPUs stay at 0%). Memory utilization on my system peaks at about 0.1%. I don't have an indication if the process is going well or if it has crashed or how long it might take for that matter.
I was never able to reproduce the error. It could have been a hardware-related issue, since sqlite3 can be slow on NFS filesystems. I will run the 84 Ensembl GTF file and report back here.
Also, you mention ChIP-seq data -- is that related to this db creation issue in some way?
Are you using the disable_infer_genes
and disable_infer_transcripts
kwargs? The following code:
# using:
# http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.84.chr.gtf.gz
import gffutils
gffutils.create_db(
'Homo_sapiens.GRCh38.84.chr.gtf.gz',
'Homo_sapiens.GRCh38.84.chr.gtf.db',
disable_infer_genes=True,
disable_infer_transcripts=True)
took under half an hour:
real 24m56.864s
user 18m54.308s
sys 0m39.158s
Can you provide the exact commands you're using?
I did this:
import gffutils
fn = "Homo_sapiens.GRCh38.84.chr.gtf" # this was gunzipped into the working directory
db = gffutils.create_db(fn, dbfn='test.db', force=True, keep_order=True, merge_strategy='merge', sort_attribute_values=True)
It looks like my processes finished overnight. I did not set disable_infer_genes=True and disable_infer_transcripts=True
as you did in your code. I naively thought that I would want the gene information and transcript information, so I had kept these on. Is it ok to use the databases I created overnight?
Any idea why with my code the database size ramped up quickly to a 2.3GB db then basically remained the same size while the CPU worked all night? It sounds like inferring genes and transcripts is what took such a long time so was wondering what that computation is (a sort operation?). The final database size this morning is 2.4GB, not much different from the 2.3GB it ramped up to in about 20-30 minutes. I thought that the create_db() function was simply packaging the GTF file's column data into a sql database but clearly there is more computation going on!
You want disable_infer_transcripts=True
and disable_infer_genes=True
for this file. Ensembl's GTFs are not true GTFs since they contain lines for genes and transcripts already. If you're using a recent version of gffutils
, you should get a warning saying that there were gene/transcripts in your GTF already and that it could take a while if you don't disable inferring genes/transcripts.
Putting the GTF into the features table and relations table is just the first step. When gffutils
detects a GTF file, it does a lot of work to infer transcripts and genes. After it infers each gene or transcript, it sees that there is already a transcript or gene in the database with the same ID. That's because this file contains genes and transcripts, so they've already been imported by this point. Then gffutils does more work to merge the information from the gene or transcript it just inferred with the one that already exists in the database.
So the extra time is due to merge operations caused by name collisions, which in turn were caused from inferring genes and transcripts that already exist in the GTF file. I'd re-run using those disable_infer_*
args. It shouldn't take long, and if it takes longer than an hour let me know.
Hi @daler,
I am trying to use gffutils to work with the GRCh38 RefSeq annotations, and the database creation is taking quite some time. I am hoping you could throw us a few tips on options we should be flipping on/off. The goal of our use case is to parse exon coordinates for alternative splice variants, referenced to their genomic locations. We would like to keep the affiliations of alternate variants with their primaries, and coordinate annotations for 5'/3' UTRs too, ideally.
The gff file was obtained from here:
ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
Our attempts ran for a couple of hours on my student's machine before we aborted the process. Then, I also tried it for an hour before aborting, on my (newer) 2.9 GHz Intel Core i7 laptop. The latter is running macOS Sierra 10.12.5 with gffutils version 0.8.7.1 which was installed using conda version 4.3.18. The python is 3.6.0 and the sqlite is version 3.13.0.
thanks for any help, Takeshi
It should take <15 mins. Are you using disable_infer_genes=True
and disable_infer_transcripts=True
as described above?
Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.
Databases need a unique ID for each feature. The ID
field in this file is not unique -- for example here are the features for ID=cds1
:
NC_000001.11 Gnomon CDS 182709 182746 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11 Gnomon CDS 183114 183240 . + 1 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
NC_000001.11 Gnomon CDS 183922 184158 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1
If you use merge_strategy="merge"
, then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.
You'll need to decide how you want to be able to refer to CDSs. If you don't really have an opinion on that, you can try the merge_strategy="create_unique"
argument when creating the database which should speed things up considerably. The features above will then be called cds1
, cds1.1
, and cds1.2
. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.
Give merge_strategy="create_unique"
a try to see if it helps the speed issue. It still should run in <15 mins.
The merge_strategy="create_unique" worked well for what we want. Thank you kindly for the pointer.
Best Takeshi
On May 25, 2017, at 10:02 AM, Ryan Dale notifications@github.com wrote:
Wait, sorry, I thought this was related to the above comments. The file you link to is a GFF, so there is no gene or transcript inference by default. It appears the issue with this file is the duplicated IDs.
Databases need a unique ID for each feature. The ID field in this file is not unique -- for example here are the features for ID=cds1:
NC_000001.11 Gnomon CDS 182709 182746 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183114 183240 . + 1 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 NC_000001.11 Gnomon CDS 183922 184158 . + 0 ID=cds1;Parent=rna37;Dbxref=GeneID:102725121,Genbank:XP_011542110.1;Name=XP_011542110.1;gbkey=CDS;gene=LOC102725121;product=uncharacterized protein LOC102725121 isoform X1;protein_id=XP_011542110.1 If you use merge_strategy="merge", then gffutils assumes the lines refer to the same feature and so does a lot of work to merge the attributes in a nice way. Looking at this file though, the CDSs should definitely be considered different features.
You'll need to decide how you want to be able to refer to CDSs. If you don't really have an opinion on that, you can try the merge_strategy="create_unique" argument when creating the database which should speed things up considerably. The features above will then be called cds1, cds1.1, and cds1.2. Alternatively you can write a transform function to do arbitrary manipulation of the features before they get into the db, for example to create your own custom ID field based on the other attributes.
Give merge_strategy="create_unique" a try to see if it helps the speed issue. It still should run in <15 mins.
― You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Is it possible to optimize database creation? It'd be interesting to see where the bottlenecks are. Would Cythonizing certain parts of the code help with this, or is the bottleneck is purely at the sqlite interface? For example, creation of a database for this GFF (Ensembl mouse genes) takes ~15 minutes:
The bottleneck is there even when dbs are created in memory, though it's of course smaller in these cases. This is related to another issue we discussed which is: how to create derivative GFF files from an existing ones, i.e. how to iterate through a GFF and make a new version of it. For example, if you wanted to iterate through the above GFF and simply add an optional key, value pair to the attributes of some records. This is a kind of "streaming" operation that can be done line-by-line and doesn't necessarily need a db. The overhead of creating the db makes gffutils impractical for these kinds of simple operations on large-ish GFF files.
There are more sophisticated operations (i.e. non-streaming ones) where a naive in memory solution still is considerably faster because of the database creation bottleneck. For example, the naive GFF parser used in
misopy.gff_utils
(seeload_genes_from_gff
in https://github.com/yarden/MISO/blob/fastmiso/misopy/Gene.py) simply iterates through the GFF file multiple times to collect all the gene entries into a simpleGene
object, with mRNAs represented as lists of exons. This kind of gene-centric in-memory solution is less expressive than gffutils (does not handle arbitrary nesting, and ignores non-genes basically) but for simple operations like "Add attribute X to all genes" or "Reformat the IDs of this GFF" it's considerably faster. It's not actually faster once the DB is created; gffutils retrieval is excellent, but the overhead of creation of the db trumps the computation time for many of these operations.In summary, I'm wondering if there's a way to try to bridge the gap between the fast, but hacky solutions and the sophisticated gffutils solution that comes with an overhead. I think this is an important issue because many of the operations done on GFFs (at least that I do) don't require many hierarchical SQL queries.