Open hou098 opened 1 year ago
~~It might also be worth partitioning otu.taxonomy
and (if possible) otu.taxonomy_otu
by taxonomy_source_id. See
https://www.postgresql.org/docs/15/ddl-partitioning.html~~
In postgres 10, there are quite a few limitations on partitioning, and it looks like upgrading to postgres 11 requires changing the underlying database files.
I've introduced otu.taxonomy_otu_export
and partioned that, which seems to work OK
See
https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta2
See https://docs.sqlalchemy.org/en/13/dialects/postgresql.html?highlight=partition https://github.com/sqlalchemy/sqlalchemy/issues/5313#issuecomment-931332837 https://stackoverflow.com/questions/61545680/postgresql-partition-and-sqlalchemy https://gist.github.com/Multihuntr/8e613ac6fe86967e84ee0e0e921bdffb
The OTU+contextual download is slow. It gets about 200KB/s in production.
In the development environment,
top
shows that the python process is probably the bottleneck, as it hovers close to 100% CPU utilisation during the download. This indicates that it's not waiting on IO, nor is it bound by the database server performance.I tested the OTU+contextual download with compression switched off, but it made very little difference. It's likely that most of the problem is just the Python interpreter overhead. See timing below.
While python performance is the ultimate bottleneck, the postgres database queries can also be a bottleneck, especially when multiple taxonomies are involved.
To address this, I've introduced the
otu.taxonomy_otu_export
table, partitioned ontaxonomy_source_id
. This slows down the ingest and increases storage space, but simplifies the runtime query.See https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta2
Simplifying the query seems to have had a reasonably beneficial effect. Without the partitioned table the query is
vs with the partitioned table the query is
Database performance on production
This has been enhanced by switching to a larger AWS instance in late 2022. Before this the database performance was terrible, and probably the bottleneck for OTU+contextual download. It now seems to be better and is probably no longer the bottleneck.
Pypy?
See https://docs.sqlalchemy.org/en/14/faq/performance.html
This will involve talking to
bioplatforms.com
, as they handle the production deployment.Alternative idea: use md5 OTUs and provide a FASTA file lookup table in the download bundle
This actually slowed down the OTU+contextual download slightly.
This was tested using https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 which added a materialized view in an attempt to speed up the download. (That didn't speed up the multiple-taxonomy case very much, and used a lot of space, so I abandoned that in favour of using a partitioned table (taxonomy_otu_export))
With md5 OTUs in the materialized view:
vs long GATC-style OTUs in the materialized view
This difference might be due to random variation or some cache effects, but using the hashed OTUs is no quicker. This doesn't even include the generation of the separate FASTA lookup table file, which needs to be included in the zipfile download.
The short md5-style OTUs were tested by regenerating the materialized view using
See https://github.com/BioplatformsAustralia/bpaotu/tree/hashed-otu-fasta-sidecar for a prototype that implements this FASTA-sidecar idea. Note that this pre-dates the use of the
otu.taxonomy_sample_otu
materialized view.Profiling
Profiling the download is a bit tricky as you need to profile the iterators called by
zipstream.ZipFile
during the streaming download, but it can be done with some decorators and a wrapper generator (see below), which probably doesn't affect the profiling too much.Again, this was done using https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 which uses a materialized view instead of the partitioned table
taxonomy_otu_export
, but this shouldn't affect the results too much in this case.With one taxonomy, long GATC-style OTUs,and using the materialized view for OTU+Contextual download in https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 we see the following. Note the 2nd last line which basically shows that it's using close to 100% CPU.
Profiling wrapper code