DAPloaders.py takes too long to load large data sets

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Execute a load script, e.g.: loaders/CANON/loadCANON_may2012.py 
2. You see output like:
...
INFO 2012-06-22 21:03:33,669 DAPloaders DAPloaders.py _genData():512 Reading 
binary equiv from 
http://elvis.shore.mbari.org:8080/thredds/dodsC/lrauv/daphne/2012/20120531_20120
607/20120606T050637/slate.nc.ascii?mass_concentration_of_chlorophyll_in_sea_wate
r[0:1:33319]
INFO 2012-06-22 21:03:56,665 DAPloaders DAPloaders.py innerProcess_data():759 
500 records loaded.
INFO 2012-06-22 21:04:11,600 DAPloaders DAPloaders.py innerProcess_data():759 
1000 records loaded.
INFO 2012-06-22 21:04:24,521 DAPloaders DAPloaders.py innerProcess_data():759 
1500 records loaded.
....

What is the expected output? What do you see instead?

There is about 10-15 seconds elapsed time between each 500 records loaded 
giving a load rate of only about 50 records per second.  At this rate it can 
take several days to load a large data set.  We would like to see faster 
loading of data.

An attempt was made to wrap the the process_data() method in DAPloaders.py in a 
transaction, but the performance gain was not that great.  Here are the notes 
from the Mercurial log on implemanting that change, and finally removing it: 

changeset:   409:af0ec28ef60e
user:        Mike McCann MBARIMike@gmail.com
date:        Thu May 17 23:57:40 2012 -0700
files:       loaders/DAPloaders.py
description:
Add @transaction.commit_manually() decorator on the process_data() method.  
Tried adding 'using=self.dbName' as a arg, but self is not see there.  ???

changeset:   545:ee79011d20bd
user:        Mike McCann MBARIMike@gmail.com
date:        Tue Jun 19 00:08:52 2012 -0700
files:       loaders/DAPloaders.py
description:
Removed @commit_manually decorators and transaction.commit(using=self.dbAlias)s 
as speedup was not enough to counteract the poor exception reporting without 
decorators on 
all the methods called in process_data()

I suspect that some method of a bulk load must be done to achieve a real 
performance boost.  And, I just noticed this in the Django documentation:  
https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create, which 
is new in Django 1.4.

This needs some investigation.

Original issue reported on code.google.com by MBARIm...@gmail.com on 25 Jun 2012 at 5:25

GoogleCodeExporter commented 9 years ago

Doing bulk loads into the highly normalized stoqs database (inserts made to 
several tables in each transaction) would entail a fair amount of work - and 
really would duplicate a lot of what Django does.

Some code improvements have been made over the last year to reduce load times.  
On my development VM I typically see 6-7 seconds for each 500 records.

Throwing a bit of money and database tuning at this problem has helped too.  
This summer we are configuring MBARI's internal server "kraken" following 
guidance in the book "PostgreSQL 9.0 High Performance" and with spinning disks 
get load times of 5-6 seconds for each 500 records. We expect this to be much 
better with the FusionIO drive.

Original comment by MBARIm...@gmail.com on 24 Jul 2013 at 4:54

Changed state: Started

GoogleCodeExporter commented 9 years ago

Having worked with loading data from several campaigns following the initial 
posting of this issue we've determined that the current load times are 
acceptable and are closing this issue.

Original comment by MBARIm...@gmail.com on 24 Oct 2013 at 4:47

Changed state: WontFix

PiRSquared17 / stoqs

DAPloaders.py takes too long to load large data sets #11