Open hou098 opened 2 years ago
Hi @hou098
I looked at two things first.
Prod and Staging VMs are only configured with two CPU cores. However, you are only looking at using pipes so this shouldn't prove an issue.
I compared the size of the data compressed and uncompressed. 4.2G AM_data_db_submit_202203050842 87G AM_data_db_submit_202203050842_uncompress
and the size of the tar.gz with the files compressed / uncompressed
-rw-r--r-- 1 bioplatforms bioplatforms 4408525467 Mar 11 16:07 AM_data_db_submit_202203050842.tar.gz -rw-r--r-- 1 bioplatforms bioplatforms 4457138470 Mar 18 11:40 AM_data_db_submit_202203050842_uncompress.tar.gz
So my question is do we need these files compressed (except for transfer)?
Hi @mtearle ,
4.2G AM_data_db_submit_202203050842
So that's a directory full of *.gz files?
87G AM_data_db_submit_202203050842_uncompress
And that's the same thing but with the files gunzipped?
So my question is do we need these files compressed (except for transfer)?
My bet is that there's probably an overall win in reading and decompressing the *.gz files on the fly in python as I'm pretty sure gz decompression is reasonably cheap and probably faster than the extra IO required to read the equivalent uncompressed data, especially since there's so much redundancy in those text files (lots of repeated strings and characters, hence 87Gb vs 4.2Gb).
I think the bigger bottleneck is getting the data into Postgres. Wherever possible I'm slurping it in from CSV files which I think is the fastest known method. See https://www.postgresql.org/docs/current/populate.html By doing this with pipes we might be able to reduce the load time down to just the time taken to do the Postgres COPYs or thereabouts. I suspect doing some research on optimising Postgres inserts may give us the biggest win.
As for wrapping all those *.gz files up in a tar file, there's probably no point in gzipping the tar file. On the other hand, it probably doesn't hurt much either.
Ingest might be sped up by using pipes instead of temporary files so that we can do the db update in parallel with reading the
*.gz
files.See https://docs.python.org/3/library/multiprocessing.html and https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pipe
Use
Connection.fileno()
to get underlying file handles, thenfdopen()
those to get file objects thatcsv.writer()
andcopy_expert()
can write/read respectively. https://docs.python.org/3/library/multiprocessing.html#multiprocessing.connection.Connection.filenoFrom https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork
Needs investigation. A lot of the ingest time seems to be spent in the postgres COPY FROM STDIN command, so the gain might not be dramatic. Note that the taxonomy ontologies and
otu.OTU()
need to be built beforeotu.Taxonomy()
can be populated, so we will still need to make multiple passes through the*.taxonomy.gz
files. Might get more of a win with the abundance files*.txt.gz