OTU+contextual download performance

The OTU+contextual download is slow. It gets about 200KB/s in production.

In the development environment, top shows that the python process is probably the bottleneck, as it hovers close to 100% CPU utilisation during the download. This indicates that it's not waiting on IO, nor is it bound by the database server performance.

I tested the OTU+contextual download with compression switched off, but it made very little difference. It's likely that most of the problem is just the Python interpreter overhead. See timing below.

While python performance is the ultimate bottleneck, the postgres database queries can also be a bottleneck, especially when multiple taxonomies are involved.

To address this, I've introduced the otu.taxonomy_otu_export table, partitioned on taxonomy_source_id. This slows down the ingest and increases storage space, but simplifies the runtime query.

See https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta2

Simplifying the query seems to have had a reasonably beneficial effect. Without the partitioned table the query is

 SELECT otu.taxonomy.id, otu.taxonomy.amplicon_id, otu.taxonomy.taxonomy_source_id, otu.taxonomy.traits, otu.taxonomy.r1_id, otu.taxonomy.r2_id, otu.taxonomy.r3_id, otu.taxonomy.r4_id, otu.taxonomy.r5_id, otu.taxonomy.r6_id, otu.taxonomy.r7_id, otu.taxonomy.r8_id, otu.otu.id, otu.otu.code, otu.sample_otu.sample_id, otu.sample_otu.otu_id, otu.sample_otu.count, otu.sample_otu.count_20k
 FROM otu.sample_otu, otu.taxonomy JOIN otu.taxonomy_otu AS taxonomy_otu_1 ON otu.taxonomy.id = taxonomy_otu_1.taxonomy_id JOIN otu.otu ON otu.otu.id = taxonomy_otu_1.otu_id
 WHERE otu.otu.id = otu.sample_otu.otu_id AND otu.taxonomy.amplicon_id = 5 AND otu.taxonomy.taxonomy_source_id = 3

vs with the partitioned table the query is

SELECT otu.taxonomy_otu_export.amplicon_id, otu.taxonomy_otu_export.traits, otu.taxonomy_otu_export.otu_id, otu.taxonomy_otu_export.taxonomy_source_id, otu.taxonomy_otu_export.r1_id, otu.taxonomy_otu_export.r2_id, otu.taxonomy_otu_export.r3_id, otu.taxonomy_otu_export.r4_id, otu.taxonomy_otu_export.r5_id, otu.taxonomy_otu_export.r6_id, otu.taxonomy_otu_export.r7_id, otu.taxonomy_otu_export.r8_id, otu.sample_otu.sample_id, otu.sample_otu.otu_id, otu.sample_otu.count, otu.sample_otu.count_20k, otu.otu.id, otu.otu.code 
FROM otu.taxonomy_otu_export, otu.sample_otu, otu.otu 
WHERE otu.taxonomy_otu_export.otu_id = otu.sample_otu.otu_id AND otu.sample_otu.otu_id = otu.otu.id AND otu.taxonomy_otu_export.amplicon_id = 3 AND otu.taxonomy_otu_export.taxonomy_source_id = 8

Database performance on production

This has been enhanced by switching to a larger AWS instance in late 2022. Before this the database performance was terrible, and probably the bottleneck for OTU+contextual download. It now seems to be better and is probably no longer the bottleneck.

Pypy?

See https://docs.sqlalchemy.org/en/14/faq/performance.html

We can reduce the time by a factor of nearly three using recent versions of PyPy:

This will involve talking to bioplatforms.com, as they handle the production deployment.

Alternative idea: use md5 OTUs and provide a FASTA file lookup table in the download bundle

This actually slowed down the OTU+contextual download slightly.

This was tested using https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 which added a materialized view in an attempt to speed up the download. (That didn't speed up the multiple-taxonomy case very much, and used a lot of space, so I abandoned that in favour of using a partitioned table (taxonomy_otu_export))

With md5 OTUs in the materialized view:

217082700 function calls (217082648 primitive calls) in 288.911 seconds
...
# ncalls  tottime  percall  cumtime  percall filename:lineno(function)
52262   32.430    0.001   33.268    0.001 {method 'fetchmany' of 'psycopg2.extensions.cursor' objects}
2612808   23.358    0.000   23.358    0.000 {method 'compress' of 'zlib.Compress' objects}

vs long GATC-style OTUs in the materialized view

 217083101 function calls (217083049 primitive calls) in 255.502 seconds
...
# ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     52262   29.644    0.001   30.317    0.001 {method 'fetchmany' of 'psycopg2.extensions.cursor' objects}
   2612808   28.072    0.000   28.072    0.000 {method 'compress' of 'zlib.Compress' objects}

This difference might be due to random variation or some cache effects, but using the hashed OTUs is no quicker. This doesn't even include the generation of the separate FASTA lookup table file, which needs to be included in the zipfile download.

The short md5-style OTUs were tested by regenerating the materialized view using

    __table__ = create_materialized_view(
        name='taxonomy_sample_otu',
        selectable=select(
            [
     ...
                func.md5(OTU.code).label('code'),

See https://github.com/BioplatformsAustralia/bpaotu/tree/hashed-otu-fasta-sidecar for a prototype that implements this FASTA-sidecar idea. Note that this pre-dates the use of the otu.taxonomy_sample_otu materialized view.

Profiling

Profiling the download is a bit tricky as you need to profile the iterators called by zipstream.ZipFile during the streaming download, but it can be done with some decorators and a wrapper generator (see below), which probably doesn't affect the profiling too much.

Again, this was done using https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 which uses a materialized view instead of the partitioned table taxonomy_otu_export, but this shouldn't affect the results too much in this case.

With one taxonomy, long GATC-style OTUs,and using the materialized view for OTU+Contextual download in https://github.com/BioplatformsAustralia/bpaotu/tree/1.37.0-beta1 we see the following. Note the 2nd last line which basically shows that it's using close to 100% CPU.

runserver_1     |          217083101 function calls (217083049 primitive calls) in 252.864 seconds
runserver_1     |
runserver_1     |    Ordered by: cumulative time
runserver_1     |    List reduced from 266 to 100 due to restriction <100>
runserver_1     |
runserver_1     |    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
runserver_1     |   2612808    2.380    0.000  249.054    0.000 /env/lib/python3.8/site-packages/werkzeug/debug/__init__.py:326(debug_application)
runserver_1     |   2612808    1.044    0.000  244.362    0.000 /env/lib/python3.8/site-packages/zipstream/__init__.py:180(__iter__)
runserver_1     |   2612808    6.792    0.000  243.318    0.000 /env/lib/python3.8/site-packages/zipstream/__init__.py:231(__write)
runserver_1     |   2612808    1.156    0.000  199.835    0.000 /app/bpaotu/bpaotu/tabular.py:71(inner)
runserver_1     |   2612809   16.583    0.000  198.680    0.000 /app/bpaotu/bpaotu/tabular.py:143(sample_otu_csv_rows_mv)
runserver_1     |   2612808    4.667    0.000   94.806    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:35(instances)
runserver_1     |     52257    1.932    0.000   47.036    0.001 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:81(<listcomp>)
runserver_1     |   2612807   13.786    0.000   45.104    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:509(_instance)
runserver_1     |   2612807    2.382    0.000   44.921    0.000 /app/bpaotu/bpaotu/tabular.py:226(ids_to_names)
runserver_1     |   2612807   13.112    0.000   42.539    0.000 /app/bpaotu/bpaotu/tabular.py:227(<listcomp>)
runserver_1     |     52258    0.146    0.000   34.768    0.001 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1281(fetchmany)
runserver_1     |     52258    1.787    0.000   33.731    0.001 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1452(_fetchmany_impl)
runserver_1     |   2612809    1.318    0.000   31.570    0.000 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1443(_fetchone_impl)
runserver_1     |     52262    0.324    0.000   30.589    0.001 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1430(__buffer_rows)
runserver_1     |     52262   29.515    0.001   30.180    0.001 {method 'fetchmany' of 'psycopg2.extensions.cursor' objects}
runserver_1     |   2612808   27.418    0.000   27.418    0.000 {method 'compress' of 'zlib.Compress' objects}
runserver_1     |  20954776    9.054    0.000   23.906    0.000 {built-in method builtins.getattr}
runserver_1     |  31353684   22.909    0.000   22.909    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py:278(__get__)
runserver_1     |   2612808   22.606    0.000   22.606    0.000 {method 'writerow' of '_csv.writer' objects}
runserver_1     |   2612807   13.054    0.000   16.720    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:710(_populate_full)
runserver_1     |   2612807    4.130    0.000    9.239    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/instrumentation.py:340(new_instance)
runserver_1     |   2612808    3.995    0.000    5.980    0.000 /env/lib/python3.8/site-packages/zipstream/__init__.py:91(write)
runserver_1     |   2612807    2.849    0.000    5.741    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/state.py:371(_cleanup)
runserver_1     |  20902456    5.534    0.000    5.534    0.000 /app/bpaotu/bpaotu/tabular.py:85(_ontology_lookup)
runserver_1     |   2612807    1.918    0.000    4.328    0.000 /app/bpaotu/bpaotu/util.py:21(array_or_empty)
runserver_1     |   2612808    2.823    0.000    3.810    0.000 /env/lib/python3.8/site-packages/werkzeug/serving.py:246(write)
runserver_1     |   2612807    1.606    0.000    3.667    0.000 /env/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/array.py:337(process)
runserver_1     |   2612807    3.600    0.000    3.600    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/state.py:89(__init__)
runserver_1     |   2612807    2.287    0.000    2.892    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/identity.py:216(_fast_discard)
runserver_1     |     52259    0.104    0.000    2.730    0.000 /env/lib/python3.8/site-packages/sqlalchemy/util/_collections.py:775(unique_list)
runserver_1     |     52257    1.521    0.000    2.625    0.000 /env/lib/python3.8/site-packages/sqlalchemy/util/_collections.py:781(<listcomp>)
runserver_1     |   2612807    1.732    0.000    2.528    0.000 /app/bpaotu/bpaotu/util.py:53(format_sample_id)
runserver_1     |   2612808    2.459    0.000    2.459    0.000 {built-in method zlib.crc32}
runserver_1     | 1362921/1362920    1.691    0.000    2.411    0.000 {method 'join' of 'str' objects}
runserver_1     |   2612808    2.000    0.000    2.312    0.000 /env/lib/python3.8/site-packages/django/http/response.py:224(make_bytes)
runserver_1     |  11814271    2.166    0.000    2.166    0.000 {built-in method builtins.isinstance}
runserver_1     |   1362917    1.656    0.000    2.060    0.000 /env/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/array.py:288(_proc_array)
runserver_1     |   2612807    1.636    0.000    1.636    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/identity.py:161(_add_unpresent)
runserver_1     |   2612807    1.346    0.000    1.346    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:525(<listcomp>)
runserver_1     |   2612807    1.215    0.000    1.215    0.000 {method 'issuperset' of 'frozenset' objects}
runserver_1     |   7890721    1.174    0.000    1.174    0.000 {built-in method builtins.len}
runserver_1     |   2612807    1.162    0.000    1.162    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/identity.py:166(get)
runserver_1     |   2612808    1.153    0.000    1.153    0.000 {method 'getvalue' of '_io.StringIO' objects}
runserver_1     |   2612808    0.917    0.000    0.917    0.000 {method 'encode' of 'str' objects}
runserver_1     |     52258    0.120    0.000    0.868    0.000 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1240(process_rows)
runserver_1     |   2612807    0.834    0.000    0.834    0.000 <string>:1(set)
runserver_1     |   2612807    0.796    0.000    0.796    0.000 {method 'startswith' of 'str' objects}
runserver_1     |   2612808    0.779    0.000    0.779    0.000 {method 'truncate' of '_io.StringIO' objects}
runserver_1     |     52258    0.748    0.000    0.748    0.000 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1253(<listcomp>)
runserver_1     |   4174062    0.719    0.000    0.719    0.000 /app/bpaotu/bpaotu/util.py:25(<genexpr>)
runserver_1     |   2612807    0.714    0.000    0.714    0.000 {method 'replace' of 'str' objects}
runserver_1     |   2612812    0.674    0.000    0.674    0.000 {built-in method __new__ of type object at 0x7f4aa7b9f940}
runserver_1     |    836194    0.328    0.000    0.665    0.000 /usr/local/lib/python3.8/encodings/utf_8.py:15(decode)
runserver_1     |         1    0.000    0.000    0.653    0.653 /env/lib/python3.8/site-packages/sqlalchemy/orm/query.py:3400(__iter__)
runserver_1     |         1    0.000    0.000    0.653    0.653 /env/lib/python3.8/site-packages/sqlalchemy/orm/query.py:3425(_execute_and_instances)
runserver_1     |         1    0.000    0.000    0.653    0.653 /env/lib/python3.8/site-packages/sqlalchemy/engine/base.py:916(execute)
runserver_1     |         1    0.000    0.000    0.653    0.653 /env/lib/python3.8/site-packages/sqlalchemy/sql/elements.py:291(_execute_on_connection)
runserver_1     |         1    0.000    0.000    0.653    0.653 /env/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1052(_execute_clauseelement)
runserver_1     |         1    0.000    0.000    0.651    0.651 /env/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1159(_execute_context)
runserver_1     |         1    0.000    0.000    0.651    0.651 /env/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py:579(get_result_proxy)
runserver_1     |         1    0.000    0.000    0.651    0.651 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:767(__init__)
runserver_1     |         1    0.000    0.000    0.651    0.651 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1408(_init_metadata)
runserver_1     |   5225615    0.617    0.000    0.617    0.000 {built-in method builtins.id}
runserver_1     |   2612807    0.605    0.000    0.605    0.000 {method 'pop' of 'dict' objects}
runserver_1     |   2612808    0.603    0.000    0.603    0.000 {method 'flush' of '_io._IOBase' objects}
runserver_1     |   2612808    0.535    0.000    0.535    0.000 /env/lib/python3.8/site-packages/zipstream/__init__.py:61(closed)
runserver_1     |   2612838    0.488    0.000    0.488    0.000 {method 'add' of 'set' objects}
runserver_1     |   2612808    0.486    0.000    0.486    0.000 {method 'seek' of '_io.StringIO' objects}
runserver_1     |   2612870    0.374    0.000    0.374    0.000 {method 'append' of 'list' objects}
runserver_1     |    836194    0.337    0.000    0.337    0.000 {built-in method _codecs.utf_8_decode}
runserver_1     |   2612807    0.313    0.000    0.313    0.000 {method 'popleft' of 'collections.deque' objects}
runserver_1     |      1656    0.008    0.000    0.062    0.000 /usr/local/lib/python3.8/socketserver.py:825(write)
runserver_1     |     52262    0.056    0.000    0.056    0.000 {built-in method builtins.min}
runserver_1     |      1656    0.054    0.000    0.054    0.000 {method 'sendall' of '_socket.socket' objects}
runserver_1     |     52275    0.016    0.000    0.016    0.000 {method 'get' of 'dict' objects}
runserver_1     |     52257    0.013    0.000    0.013    0.000 {method 'items' of 'dict' objects}
runserver_1     |         1    0.000    0.000    0.004    0.004 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:1439(_soft_close)
runserver_1     |         1    0.000    0.000    0.004    0.004 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:924(_soft_close)
runserver_1     |         1    0.000    0.000    0.004    0.004 /env/lib/python3.8/site-packages/sqlalchemy/engine/base.py:1333(_safe_close_cursor)
runserver_1     |         1    0.004    0.004    0.004    0.004 {method 'close' of 'psycopg2.extensions.cursor' objects}
runserver_1     |         1    0.000    0.000    0.001    0.001 <string>:1(<lambda>)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/elements.py:405(compile)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/elements.py:470(_compiler)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:527(__init__)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:274(__init__)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:349(process)
runserver_1     |      43/1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/visitors.py:86(_compiler_dispatch)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:2045(visit_select)
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/engine/default.py:589(do_execute)
runserver_1     |         1    0.001    0.001    0.001    0.001 {method 'execute' of 'psycopg2.extensions.cursor' objects}
runserver_1     |         1    0.000    0.000    0.001    0.001 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:2127(<listcomp>)
runserver_1     |        15    0.000    0.000    0.001    0.000 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:1799(_label_select_column)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/query.py:3929(_compile_context)
runserver_1     |        15    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:810(visit_label)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:793(_init_metadata)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/engine/result.py:269(__init__)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:2233(_compose_select_body)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/loading.py:59(<listcomp>)
runserver_1     |         1    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/orm/query.py:4262(row_processor)
runserver_1     |        18    0.000    0.000    0.000    0.000 /env/lib/python3.8/site-packages/sqlalchemy/sql/compiler.py:864(visit_column)
runserver_1     |
runserver_1     |
runserver_1     | [('u', 236.55), ('s', 11.270000000000003), ('U', 0.0), ('S', 0.0), ('e', 254.589999999851)]
runserver_1     | === end sample_otu_csv_rows_mv

Profiling wrapper code

"""
Example usage:

from profiling import gen_dec

@gen_dec('sample_otu_csv_rows')
def sample_otu_csv_rows(taxonomy_labels, ids_to_names, q):
    ...
    for row in q.yield_per(50):
        ...
        yield something
"""

from datetime import datetime
from contextlib import contextmanager
import cProfile, pstats, io, time, os
from pstats import SortKey

@contextmanager
def time_exec(label):
    print('== start ', label)
    start = datetime.now()
    try:
        yield start
    finally:
        end = datetime.now()
        print('=== end ', label, end - start)

@contextmanager
def time_profile(label):
    print('== start ', label)
    start = os.times()
    # Use time.process_time to profile CPU time instead of wall time
    #pr = cProfile.Profile(time.process_time)
    pr = cProfile.Profile()
    try:
        pr.enable()
        yield pr
    finally:
        pr.disable()
        end = os.times()
        ps = pstats.Stats(pr).sort_stats(SortKey.CUMULATIVE)
        ps.print_stats(100)
        print([(lab, (b-a)) for lab, a, b in zip('usUSe', start, end)])
        print('=== end', label)

def gen_dec(label, timer_class=time_profile):
    """"
    timer_class: time_profile or time_exec
    """
    def d(gen):
        def inner(*args, **kwargs):
            with timer_class(label):
                for v in gen(*args, **kwargs):
                    yield v
        return inner
    return d

BioplatformsAustralia / bpaotu