gffutils preparation of genome annotations for cufflinks is very slow

schelhorn commented 9 years ago

I am currently profiling a couple of RNA-Seq analyses and by far the longest running process is not mapping or the main cufflinks process (as one would assume), but preparation of the genomic reference database in fix_cufflinks_attributes. That function loads in a couple of millions of genomic annotations from a reference GTF that are heavily post-processed by gffutils to deliver an in-memory sqlite database that is later used by bcbio. And that takes hours.

The main problem is not only that database preparation is slow (probably slower than it would have to be due to a couple of very expensive joins in gffutils that happen during database construction), but that the database preparation is repeated every time bcbio runs an analysis (or is restarted on the same working directory), since the annotation database is in-memory.

While it is possible to speed up the process a little (using pragmas and indexes), it is still very wasteful to generate the annotation database anew each time a sample is analyzed. Since the annotation database is only depending on the reference annotation, which is constant, I strongly suggest it should instead be distributed as a pre-generated file by bcbio or cloudbiolinux as part of the reference material. The pre-generated sqllite database could then be loaded into gffutils in fix_cufflinks_attributes. As far as I know, all facilities to generate, store and load such gffutils databases are already in place in gffutils itself, so there would be no overhead other than making the database once, adding it to the RNA-Seq reference package, and loading it in fix_cufflinks_attributes (or fall back to in-memory if it is not available).

schelhorn commented 9 years ago

I proposed an improvement upstream in gffutils that sped up generation of the annotation database about 100-fold on my system using pragmas and indexes. Still, I recommend that the pre-generated database should be part of the reference resources instead.

roryk commented 9 years ago

Hi @schelhorn,

I hope we meet in real life one day so I can buy you a coffee. There is a pre-generated database that comes with the reference resources; we were remaking it though by accident in this function. We can't avoid making the database all of the time though-- the purpose of that fix_cufflinks_attributes function is when you assemble a new transcriptome to preserve the original gene and transcript ids, so we have to make a new database for each run if the transcriptome is assembled. If you click off assemble_transcriptome in the YAML file this function won't run if you don't care about assembling a new transcriptome.

Thanks for those upstream fixes. We noticed gffutils was super slow building a database for the release 79 annotations when we were working on hg19 support, I think your proposed change might fix that too.

schelhorn commented 9 years ago

@daler accepted the improvement proposal and the performance increase will probably be part of the next gffutils pypi release. Ceterum censeo...

roryk commented 9 years ago

Related, we've been playing with replacing Cufflinks with StringTie (http://ccb.jhu.edu/software/stringtie/) to speed this part up even more but haven't had a chance to check the reconstructions against each other yet.

schelhorn commented 9 years ago

There is a pre-generated database that comes with the reference resources; we were remaking it though by accident in this function.

Excellent; considering cdd723b the transcriptome assembly should be much faster now - thanks. This issue is resolved.

If you click off assemble_transcriptome in the YAML file this function won't run if you don't care about assembling a new transcriptome. (...) Related, we've been playing with replacing Cufflinks with StringTie.

Tapping StringTie is a great way to go once the high FPMK issue is resolved. However, I personally would prefer keeping both Cufflinks and StringTie as alternatives. So I suggest that the assemble_transcripts option in the YAML file should be converted from a boolean into a list that accepts cufflinks and stringtie values. In that manner, users can decide if they want to continue generating both assemblies as an intermediate solution until StringTie is sufficiently trusted.

I hope we meet in real life one day so I can buy you a coffee.

That could be arranged if you'd care to hop over to Dublin in July for HitSeq or ISMB. It would have to be Irish Coffee then, naturally.

roryk commented 9 years ago

Hi @schelhorn,

Totally agree with everything you said. I can't make it to ISMB this year unfortunately but I'm sure I'll have another chance to settle up the coffee debt. Thanks again!

daler commented 9 years ago

Hi @schelhorn and @roryk --

I'd be happy to help figure out how to speed things up on the gffutils side. Usually you have to choose the right db creation options, but as I mentioned in gffutils #48, in some cases the right options might not exist yet in gffutils.

For example, even given the fix in 5c7e7427a, a GTF file with transcript features but not gene features will still incorrectly enable the option to infer gene and transcript extents. Once gffutils has more granular control for this I can submit a PR to address this.

Rory, you mentioned slow db creation with release 79 hg19, has this been resolved yet?

schelhorn commented 9 years ago

Followup: gffutils has been updated in PyPi:

No prob, thanks for the input. BTW, v0.8.4rc1 is up on PyPI now. --@dialer

and a new StringTie development version with fixed FPKM issue has been released:

The devel branch here on github (which we are now testing for the imminent v1.0.4 release) should be ready for testing - and should have been fixed these (and other) issues. --@gpertea

chapmanb commented 9 years ago

Sven-Eric -- thanks much. On the gffutils side, we're now pulling in the latest release candidate as conda libraries and also from the requirements file. I know @roryk is working on the StringTie side so will let him follow up on that. Thanks as always for all of the great work.

roryk commented 9 years ago

Thanks for the ping Sven-Eric, I've been waiting for the StringTie folks to make their release since they've been working on it a bunch and saying its imminent, hopefully it will be soon.

pavo commented 8 years ago

I am also struggling with how slow it is to create databases using gffutils. Are there any precreated gffutils databases for the various Ensembl annotations (eg:Homo_sapiens.GRCh38.84.gtf) available for download? Is it feasible for creation of the sqlite3 database in gffutils to be parallelized (i.e in gffutils.create_db)? I have 32 cores on my workstation but only one is being utilized so this is taking hours. Also of curiosity, what should the file size be comparatively between the gtf and the gffutils db? I am wondering if the process is stalled and I have no way to know.

daler commented 8 years ago

This should definitely not take hours, it should be on the order of 10-20 mins on a single core. Are the issues you're having is within the context of bcbio-nextgen? If not, we can continue over on https://github.com/daler/gffutils/issues/20#issuecomment-214069710.

Unfortunately sqlite3 doesn't support parallel writes and the nature of the database creation is such that parallelization won't gain much.

bcbio / bcbio-nextgen

gffutils preparation of genome annotations for cufflinks is very slow #849