BioplatformsAustralia / bpaotu

OTU database access for the Australian Microbiome
GNU Affero General Public License v3.0
5 stars 1 forks source link

Spped up ingest by using pipes instead of temporary files? #189

Open hou098 opened 2 years ago

hou098 commented 2 years ago

Ingest might be sped up by using pipes instead of temporary files so that we can do the db update in parallel with reading the *.gz files.

See https://docs.python.org/3/library/multiprocessing.html and https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pipe

Use Connection.fileno() to get underlying file handles, then fdopen() those to get file objects that csv.writer() and copy_expert() can write/read respectively. https://docs.python.org/3/library/multiprocessing.html#multiprocessing.connection.Connection.fileno

From https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork

engine = create_engine("mysql://user:pass@host/dbname")

def run_in_process():
    # process starts.  ensure engine.dispose() is called just once
    # at the beginning
    engine.dispose()

    with engine.connect() as conn:
        conn.execute(text("..."))

p = Process(target=run_in_process)
p.start()

Needs investigation. A lot of the ingest time seems to be spent in the postgres COPY FROM STDIN command, so the gain might not be dramatic. Note that the taxonomy ontologies and otu.OTU() need to be built before otu.Taxonomy() can be populated, so we will still need to make multiple passes through the *.taxonomy.gz files. Might get more of a win with the abundance files *.txt.gz

mtearle commented 2 years ago

Hi @hou098

I looked at two things first.

Prod and Staging VMs are only configured with two CPU cores. However, you are only looking at using pipes so this shouldn't prove an issue.

I compared the size of the data compressed and uncompressed. 4.2G AM_data_db_submit_202203050842 87G AM_data_db_submit_202203050842_uncompress

and the size of the tar.gz with the files compressed / uncompressed

-rw-r--r-- 1 bioplatforms bioplatforms 4408525467 Mar 11 16:07 AM_data_db_submit_202203050842.tar.gz -rw-r--r-- 1 bioplatforms bioplatforms 4457138470 Mar 18 11:40 AM_data_db_submit_202203050842_uncompress.tar.gz

So my question is do we need these files compressed (except for transfer)?

hou098 commented 2 years ago

Hi @mtearle ,

4.2G AM_data_db_submit_202203050842

So that's a directory full of *.gz files?

87G AM_data_db_submit_202203050842_uncompress

And that's the same thing but with the files gunzipped?

So my question is do we need these files compressed (except for transfer)?

My bet is that there's probably an overall win in reading and decompressing the *.gz files on the fly in python as I'm pretty sure gz decompression is reasonably cheap and probably faster than the extra IO required to read the equivalent uncompressed data, especially since there's so much redundancy in those text files (lots of repeated strings and characters, hence 87Gb vs 4.2Gb).

I think the bigger bottleneck is getting the data into Postgres. Wherever possible I'm slurping it in from CSV files which I think is the fastest known method. See https://www.postgresql.org/docs/current/populate.html By doing this with pipes we might be able to reduce the load time down to just the time taken to do the Postgres COPYs or thereabouts. I suspect doing some research on optimising Postgres inserts may give us the biggest win.

As for wrapping all those *.gz files up in a tar file, there's probably no point in gzipping the tar file. On the other hand, it probably doesn't hurt much either.