daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

FeatureDB.update on disk db very slow #227

Closed YeHW closed 5 months ago

YeHW commented 7 months ago

I'm trying to use FeatureDB.update and FeatureDB.create_introns to add intron features to the database. If the database is created in memory, the speed is very fast, but if created on disk, it appears to be very slow.

gffutils version: 0.12 python version: 3.12.0

The gtf file I'm using is from refseq ftp:

curl -s https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gtf.gz \
  | gunzip -c \
  | grep -E "^#|^NC_000024.9|^NC_000013.10" \
  | gzip > NC_000013.10-NC_000024.9.gtf.gz

It's a subset of GCF_000001405.25_GRCh37.p13_genomic.gtf.gz.

Code:

import gffutils
from typing import Dict

def _create_db_kwargs() -> Dict:
    return {
        "verbose": True,
        "disable_infer_genes": True,
        "disable_infer_transcripts": True,
        "merge_strategy": "error",
    }

def _make_db(
    gtf: str, out: str, create_db_kwargs: Dict = _create_db_kwargs()
) -> None:
    db: gffutils.FeatureDB = gffutils.create_db(
        data=gtf, dbfn=out, **create_db_kwargs
    )

    # TODO create intron features
    # Creating intron features is very fast
    db.create_introns()

    # Updating will be very slow if out is not ":memory:"
    db.update(db.create_introns(), **create_db_kwargs)

if __name__ == "__main__":
    # Very fast if db is created in memory
    _make_db(gtf="NC_000013.10-NC_000024.9.gtf.gz", out=":memory:")

    # Very slow if db is created on disk
    _make_db(gtf="NC_000013.10-NC_000024.9.gtf.gz", out="NC_000013.10-NC_000024.9.gtf.gz.db")

If db is created on disk, I observed that it hangs on this step: Populating features table and first-order relations: 0 features What could cause this? Thanks in advance for any insights!

daler commented 5 months ago

OK, figured out what the issue is here. This is effectively updating the database with itself, so it's reading and writing simultaneously.

That is, db.create_introns() iterates through a query that incrementally yields exons from the db and creates introns from them; db.update() writes them immediately back to the database they're being read-and-generated from.

Three quick fixes:

1. Convert to a list of introns first

Consume the create_introns() generator before writing. That is, instead of

db.update(db.create_introns(), **kwargs)

use

db.update(list(db.create_introns(), **kwargs)

This will increase memory usage, but it works.

2. Use WAL

WAL allows simultaneous reads/writes without blocking.

Warning, this does NOT work on a networked filesystem like those typically used on an HPC cluster!

db.set_pragmas({'journal_mode': 'WAL'})
db.update(db.create_introns(), **kwargs)

3. Write to intermediate file

If memory is an issue and you're using networked filesystem, then you can write out to file first:

with open('tmp.gtf', 'w') as fout:
    for intron in db.create_introns():
        fout.write(str(intron) + '\n')

db.update(gffutils.DataIterator('tmp.gtf'), **kwargs)

Summary

I'm not sure if anything in the code should be changed to address this. create_introns(), like most things throughout gffutils, is implemented as a generator to keep the memory footprint low. I don't think I want to return a list such that create_introns() suddenly uses way more memory than any other method. I don't want to use WAL by default because gffutils tends to be used on HPC clusters and those clusters tend to have networked filesystems. And I don't think I want to write to a temp file all the time, that seems messy.

These different solutions would each be useful in different situations. So I think the best thing to do is to add some explanatory text to both update() and create_introns() and update the docs as well.

daler commented 5 months ago

Addressed in #231