dkirkby / bossdata

Tools for accessing SDSS BOSS data
MIT License
1 stars 3 forks source link

Support efficient running at sites where data is local #109

Open dkirkby opened 8 years ago

dkirkby commented 8 years ago

The current data access model assumes that a user has rw access to $BOSS_LOCAL_ROOT and that files not already cached must be copied to $BOSS_LOCAL_ROOT. This does not work well on sites that already have most of the data directly visible via the file system, where you want to use this data directly without doing any downloads. This issue is to add support for efficiently taking advantage of local data.

The simplest approach would be to set $BOSS_LOCAL_ROOT, $BOSS_SAS_PATH and $BOSS_REDUX_VERSION so that all files appear to already be cached and no downloads are ever attempted. This does not work if any files are missing (perhaps unlikely) or when a metadata file is converted to its equivalent sqlite3 file.

dkirkby commented 8 years ago

Lets assume that the local data store is complete so we raise a RuntimeError if any expected file cannot be found, and never try to download missing files via the network.

The user access pattern that we want to preserve is:

remote_path = finder.get_spec_path(plate=4567, mdj=55589, fiber=88, lite=True)
local_path = mirror.get(remote_path)

The finder is configured by $BOSS_SAS_PATH and $BOSS_REDUX_VERSION. With the defaults:

export BOSS_SAS_PATH=/sas/dr12/boss
export BOSS_REDUX_VERSION=v5_7_0

remote_path is:

/sas/dr12/boss/spectro/redux/v5_7_0/spectra/lite/4567/spec-4567-55589-0088.fits

The mirror is configured by BOSS_DATA_URL and BOSS_LOCAL_ROOT and first checks if

$BOSS_LOCAL_ROOT/$BOSS_SAS_PATH/$BOSS_REDUX_VERSION/...

exists and, if not, tries to download it from

$BOSS_DATA_URL/$BOSS_SAS_PATH/$BOSS_REDUX_VERSION/...

The new mirror logic we want is to:

  1. Check if the file is available in the read-write $BOSS_LOCAL_ROOT.
  2. Check if the file is available in the read-only local file system encoded in $BOSS_DATA_URL.
  3. Raise a RuntimeError if not found in either place.

This should use $BOSS_LOCAL_ROOT to write and cache sqlite3 files without any modifications to the meta module. However, meta currently uses the following pattern to convert local paths returned by the mirror to sqlite3 paths:

db_path = local_path.replace('.fits', '.db')

This should be somehow delegated to the mirror instead, which then translates the read-only path under BOSS_DATA_URL into a read-write path under $BOSS_LOCAL_ROOT.

dkirkby commented 8 years ago

First step is to change the 4 direct path manipulations in meta, e.g.

db_path = local_path.replace('.fits', '.db')

becomes:

db_path = mirror.local_path_replace(local_path, '.fits', '.db')
dkirkby commented 8 years ago

This still does not quite work since meta calls mirror.local_path in several places. Instead, generalize mirror.local_path so that it can optionally replace the suffix of the returned local path:

db_path = mirror.local_path(remote_path, '.fits', '.db')

This cleans up the meta logic a bit and eliminates the static _db_path_helper().