Closed YeHW closed 5 months ago
OK, figured out what the issue is here. This is effectively updating the database with itself, so it's reading and writing simultaneously.
That is, db.create_introns()
iterates through a query that incrementally yields exons from the db and creates introns from them; db.update()
writes them immediately back to the database they're being read-and-generated from.
Three quick fixes:
Consume the create_introns()
generator before writing. That is, instead of
db.update(db.create_introns(), **kwargs)
use
db.update(list(db.create_introns(), **kwargs)
This will increase memory usage, but it works.
WAL allows simultaneous reads/writes without blocking.
Warning, this does NOT work on a networked filesystem like those typically used on an HPC cluster!
db.set_pragmas({'journal_mode': 'WAL'})
db.update(db.create_introns(), **kwargs)
If memory is an issue and you're using networked filesystem, then you can write out to file first:
with open('tmp.gtf', 'w') as fout:
for intron in db.create_introns():
fout.write(str(intron) + '\n')
db.update(gffutils.DataIterator('tmp.gtf'), **kwargs)
I'm not sure if anything in the code should be changed to address this. create_introns()
, like most things throughout gffutils, is implemented as a generator to keep the memory footprint low. I don't think I want to return a list such that create_introns()
suddenly uses way more memory than any other method. I don't want to use WAL by default because gffutils tends to be used on HPC clusters and those clusters tend to have networked filesystems. And I don't think I want to write to a temp file all the time, that seems messy.
These different solutions would each be useful in different situations. So I think the best thing to do is to add some explanatory text to both update()
and create_introns()
and update the docs as well.
Addressed in #231
I'm trying to use FeatureDB.update and FeatureDB.create_introns to add intron features to the database. If the database is created in memory, the speed is very fast, but if created on disk, it appears to be very slow.
gffutils version: 0.12 python version: 3.12.0
The gtf file I'm using is from refseq ftp:
It's a subset of
GCF_000001405.25_GRCh37.p13_genomic.gtf.gz
.Code:
If db is created on disk, I observed that it hangs on this step:
Populating features table and first-order relations: 0 features
What could cause this? Thanks in advance for any insights!