dkirkby / bossdata

Tools for accessing SDSS BOSS data
MIT License
1 stars 3 forks source link

Slow creation of spAll full db #106

Closed dkirkby closed 6 years ago

dkirkby commented 8 years ago

Creation of the the full sqlite db from the downloaded FITS file is running very slowly now. I thought this was fixed by #33, where I benchmarked the conversion at 25 minutes for the DR12 spAll. I am seeing this problem with the eBOSS v5_8_0 spAll, which is much smaller than the DR12 spAll. It looks like the time is being spent in this python loop to convert each row from FITS to SQL:

      for row in table:
          # Unroll columns with sub-arrays into a flat list to match the flat SQL schema,
          # and convert numpy types to the native python types required by sqlite3.
          values = []
          for j, column_data in enumerate(row):
              if column_data.dtype.kind == 'S':
                  values.append(column_data.rstrip())
              elif isinstance(column_data, np.ndarray):
                  values.extend(column_data.flatten().tolist())
              else:
                  values.append(column_data.item())

Has this code changed since #33, or why is it taking so much longer now? Are there new columns in the eBOSS spAll that could explain this?

dkirkby commented 8 years ago

I strongly suspect the underlying problem is due to a recent change in numpy that has already been flagged as a serious performance hit: numpy/numpy#6467. Hopefully this is fixed by numpy/numpy#6208 and makes it into 1.10.2.

dkirkby commented 8 years ago

Update bossdata.meta.create_meta_full to check the numpy version and print a warning if it is 1.10.0 or 1.10.1. Also update the install doc page to warn against using these versions of numpy.

dkirkby commented 8 years ago

Try reverting to numpy 1.9.3 using:

conda install numpy=1.9.3
conda install astropy=1.0.4
python setup.py develop

The astropy downgrade is necessary since the current astropy 1.0.5 has numpy 1.10 as a dependency.

The following test now takes a few minutes to build the db, rather than too slow (many hours?) to measure:

bossquery --what PLATE,MJD,FIBER,PLUG_RA,PLUG_DEC,Z --where 'OBJTYPE="QSO"' --verbose
dkirkby commented 8 years ago

@NobleKennamer reports that numpy 1.10.2 is out and fixes our problem:

https://github.com/numpy/numpy/compare/v1.10.2...master

Please add any test results with this new version here...

dkirkby commented 8 years ago

Numpy 1.10.2 is now available via conda so the following will install it:

conda update numpy

After doing this, I renamed my spAll-v5_7_0.db and ran a bossquery so it would need to be re-generated. Both the lite (~3.5 mins) and full (~25 mins) db build times are back to normal!