Replacing an Excel Spreadsheet with an SQL database

GeoffBryan commented 7 years ago

The Problem - too many simulations to keep track of!

I run, or rather have run, lots of (at present ~300) radiative transfer code simulations, using a code called Hochunk on the Swinburne super-computer. I'm doing this to model the SED's of accretion disks around Young Stellar Objects: inclined-012

Here are a few observed Spectral Energy Distributions (SED's):

Currently I keep track of them using an Excel spreadsheet (which is growing unwieldy) so I was thinking it might be useful to replace this with a small SQL database and a number of python scripts.

Advice wanted!

I'm looking for some advice on how to do this (without going overboard) and I thought that it might be a useful thing to talk about at code review as it’s a problem many of us (in a general sense) might come across (and using SQL and python together his likely to be a really useful general skill to pick-up).

A Proposal - SQL database plus Python scripts

Here's my proposal:

If needs be, install a RDBMS on my iMac or on g2
Design a db schema based on the text parameter file (mctherm.par) used by the simulation code using a schema design tool such as the SQL workbench for MySQL
Create a database on the database server
Create python scripts to: a). Add a record of a new simulation to the local/remote DB: add_db_entry.py "this_sub-directory"
This should search for the parameter file, parse it, check to see if it is a new entry in the db and then if it isn’t add a record to the DB and also record whether the output Fits files are in the directory (I.e. The simulation completed successfully) b). Update a record of an existing simulation to the local/remote DB: update_db_entry.py "this_subdirectory" (i.e. updates when the simulation is complete along with status) c). Remove an record of an existing simulation on the DB: remove_db_entry.py "this_subdirectory" d). Query whether an existing parameter file matches any in the DB: query_unique_db.py "mctherm.par" (Returns NULL if there is no match or returns the names and relative directory paths of those simulations that do match) e). Query db using user's designated SELECT statements and return a report: query_db query_name.qry output.csv

Issues

And here are a few of my issues:

Should I run the database server on my own iMac or on Gstar?
Which RDBMS should I run?
I've installed MySQL on my iMac but what is available on the Swinburne supercomputer?
How do I install a python-db connector library on g2
Should I include the fits output files as a BLOBs in the database or simply point to their location in the file system?
Are there issues with ports/security if I run this on my iMac?
How can I conveniently query the database and return a formatted report (at least a CSV file) I can easily view/print.
What's the overhead in doing all this? Is it worth it/the right solution?

The reasons why.....

One of the reasons for wanting to do this is to stop me running duplicate simulations – they are expensive in time and resources and I don’t want to re-run things I’ve already done – so it makes sense to find a systematic way of determining whether a set of parameters has already been simulated.

Hence if I had a python query "engine" that allowed me to put in ranges of values for various parameters and see if there is a match (or not) in the DB would be useful.

Extra Information

Extra info (a picture of the sub-directory structure, an example of the parameter file) is attached in the next comment from me below.

Cheers, Geoff

GeoffBryan commented 7 years ago

DIRECTORY STRUCTURE

I'm simulating the SED of the object LRLL 31, so the directory containing the simulations on lustre is: LRLL31-puff. There is a separate sub-directory for each simulation as shown in this screen dump:

PARAMETER FILE

Inside each of these subdirectories is a parameter file called mctherm.par which contains 149 parameters (mctherm.par.txt) which I need to parse and add to the database.

' =================== Parameter file ==================='

' Preliminaries ' 200000000 = NP = Number of photons from central star 1000000 = NPMIN = Minimum number of photons for Lucy temperature calculation 0 = NIMIN = Minimum number of iterations (Lucy temperature and/or HSEQ) 5 = NIMAX = Maximum number of iterations (Lucy temperature and/or HSEQ) NO = CPEEL = YES for peeling, NO for standard MC YES = CLUCY = YES to use Lucy method for T-corr, NO for Bjorkman & Wood -2211 = I1 = Random number seed 100000 = IWRITE = output timing information every IWRITE photons YES = CWARN = YES to print warnings & errors, NO to not. '../parfiles' = PARDIR = directory containing atmosphere, dust, and filter files NO = DENSINPUT = YES for reading in density grid, NO for calculating analytic

' Central Source Properties ' YES = CLIMB = YES for limb darkening of central source, NO for not YES = CPLANCKST = YES to use planck function for star, 'NO to use stellar atmosphere file kt05750g+3.5z+0.0.ascii = ATNAME = Stellar photosphere file 2.3 = RSTAR = Stellar radius in Solar radii 5.70e3 = TSTAR = Blackbody temperature of central star (K) 1.6 = MASSC = Mass of central star (for TSC properties)

' Outside Illumination ' NO = COUTILLUM = YES for outside illumination (ISRF), NO for none

= ISRF_SCL = Scale factor for the ISRF
= ISRF_AV = Extinction (Av) of the ISRF

' Overall Disk Properties ' 0.01 = MASSD = Disk mass in solar masses AU = CRMIN = 'Units of Rmin: Rstar, Rsub, or AU' RSTAR = CZMIN = 'RSTAR' to set ZMIN at RSTAR, 'RSUB' to scale ZMIN at Rsub, 'R100' to set ZMIN at 100AU 0.2 = FMASSD1 = Fraction of mass in disk 1 (Large grains settled disk) CYL = CDISKCURV = define disk surfaces/density using spherical radius (SPH) or cylindrical (CYL) NO = CHSEQ = solve for vertical hydrostatic equilibrium (YES or NO).
1 = ITER_HSEQ = start HSEQ solution after this number of iterations (range 1-NIMAX) (default is 1) 2 = DISK_HSEQ = apply HSEQ to big grain settled disk (1), small grain disk (2), or both (3) (default is 2)

'* GAPs, SPIRALs, and/or WARPs in disk (3-D geometries: set NPG=120 or some other 3-D value) ' NO = CGAPD = gap in disk (YES or NO)

= RGAPD1 = Inner disk gap radius (AU)
= RGAPD2 = Outer disk gap radius (AU) NO = CSPIRAL = spiral warps in disk (YES or NO)
= PITCH = pitch angle (30 is loose, 5 is tight) 10 = SN = summation parameter that determines width of the arms (2-10) 0.5 = SW = fraction of mass entrained in the arms 10.0 = RSPIRAL1 = radius in AU where spiral arms begin 100.0 = RSPIRAL2 = radius in AU where spiral arms begin NO = CDISKWARP = inner disk warp (YES or NO) 0.5 = WARPHEIGHT = Scale for disk warp (adds to scaleheight) 11 = WEXP = Exponent for azimuthal disk warp (cos**wexp) 0.1 = WARPLENGTH = radial scale-length for warp (AU,bigger is slower falloff) YES = CRIMCURVE = curved inner rim wall (YES or NO) 0.002 = RIMCURVELENGTH = radial scale-length for wall curve (AU, bigger is slower falloff)
= RIMCURVEHEIGHT = scale for rim wall curve (units of scale height)
= RIMCURVEEXP = exponent for curvature of rim (modify in conjunction with rimcurvelength) YES = CRIMPUFF = puffed-up inner rim (YES or NO) -0.3 = RIMHEIGHT = scale for inner rim puff (adds to scaleheight) 0.01 = RIMLENGTH = radial scale-length for puff (AU, bigger is slower falloff) YES = CGAPCURVE = curved inner gap wall (YES or NO) 2.0 = GAPCURVELENGTH = radial scale-length for wall curve (AU, bigger is slower falloff) 1.0 = GAPCURVEHEIGHT = scale for gap wall curve (units of scale height)
= GAPCURVEEXP = exponent for curvature of rim (modify in conjunction with gapcurvelength) YES= CGAPPUFF = puffed-up cavity wall at outer gap radius (YES or NO) 1.0 = GAPHEIGHT = scale for gap puff (adds to scaleheight) 5.0 = GAPLENGTH = radial scale-length for gap puff (AU, bigger is slower falloff) NO = CMISALIGN = misaligned inner disk (YES or NO)
= RADMISALIGN = outer radius of misaligned inner portion of disk (AU)
= INCMISALIGN = inclination of the misaligned disk (degrees)

' Large grains settled disk ' www003.par = DUSTNAME(1) = Dust file, thermal grains [disk 1] draine_opac_new.dat = DUSTNAME(5) = Dust file, <200A grains [disk 1] 0.0 = FSG(1) = Fraction of the mass in <200A grains [disk 1] 0.005 = ZSCALE(1) = Scale height in RSTAR (if CZMIN='RSTAR'), HSEQ (if CZMIN='RSUB'), or AU (if CZMIN='R100') [disk 1] 0.13 = RMIND(1) = Minimum disk radius (unit set in CRMIN above) 300.0 = RMAXD(1) = Disk outer radius in AU 2.25 = A(1) = Disk radial density exponent (~ r^(-a)) [disk 1] 1.25 = B(1) = Disk scale height exponent (~ r^(b)) [disk 1] 0.005 = ZSCALE_GAP(1) = Scale height of gap (use same units as ZSCALE) 2.25 = A_GAP(1) = gap radial density exponent 1.25 = B_GAP(1) = gap scale height exponent 0.0001 = RHOSCALE_GAP(1) = ratio of gap surface density just inside RGAPD2 to value just outside

' Small grains disk ' kmh.par = DUSTNAME(2) = Dust file, thermal grains [disk 2] draine_opac_new.dat = DUSTNAME(6) = Dust file, <200A grains [disk 2] 0.0 = FSG(2) = Fraction of the mass in <200A grains [disk 2] 0.01 = ZSCALE(2) = Scale height in RSTAR (if CZMIN='RSTAR'), HSEQ (if CZMIN='RSUB'), or AU (if CZMIN='R100') [disk 2] 0.13 = RMIND(2) = Minimum disk radius (unit set in CRMIN above) 300.0 = RMAXD(2) = Disk outer radius in AU 2.25 = A(2) = Disk density exponent (~ r^(-a)) [disk 2] 1.25 = B(2) = Disk scale height exponent (~ r^(b)) [disk 2] 0.01 = ZSCALE_GAP(2) = Scale height of gap (use same units as ZSCALE) 2.25 = A_GAP(2) = gap density exponent 1.25 = B_GAP(2) = gap scale height exponent 0.0001 = RHOSCALE_GAP(2) = ratio of gap surface density just inside RGAPD2 to value just outside

' Radial Exponential cutoff to disk ' NO = CRADEXP = add radial exponential cutoff to disk density (YES or NO)? 50 = RADEXP = scale for exponential cutoff in AU

' Disk/star accretion properties ' YES = CDISKACC = YES to include disk accretion luminosity, NO to not NO = CALPHA = YES to read alpha disk parameter, NO to read disk Mdot 0.40e-8 = ALPHA/MDOT = Disk alpha parameter (if CALPHA=YES) or accretion rate (if CALPHA=NO) 5.0 = RTRUNC = Magnetosphere co-rotation radius in RSTAR NO = CSPOT = YES if spots on star, NO if not

= SPOTLAT = latitude of spot (degrees) 1 = NSPOT = number of spots (1 or 2) NO = CTEMP = YES to read spot temperature, NO to read Fractional spot size (computes one from the other) 0.01 = TSPOT/FSPOT = Temperature of spot (CTEMP=YES) or Fractional area of hot spot (CTEMP=NO)

' Envelope Properties ' r400_ice095.par = DUSTNAME(3) = Dust file, thermal grains [envelope] draine_opac_new.dat = DUSTNAME(7) = Dust file, <200A grains [envelope] 0.0 = FSG(3) = Fraction of the mass in <200A grains [envelope]

= RMAX = Maximum envelope radius in AU 0.09 = RMINE = Minimum envelope radius (unit set in CRMIN above) ULRICH = ENVTYPE = ULRICH (rotational infall) or POWLAW 0.e-8 = RATE = Mass infall rate for ULRICH envelope (solar masses/year) 300.0 = RC = Centrifugal radius for ULRICH envelope (AU) 1.e-14 = RHODENS1 = (fiducial) density at 1 AU, if ENVTYPE=POWLAW 1.5 = ENVEXP = exponent of power law density, if ENVTYPE=POWLAW 1.67e-20 = RHOAMB = Ambient density,floor for envelope density ' option for Fractal density variations ' NO = CFRACTAL = YES to include fractal variations in density, NO to not 0.9 = DENSRATIO = ratio of clumped density to smooth 0,0,0,1 = FRACTMASK = apply to disk1,disk2,envelope,outflow (0=no, 1=yes) -3144 = IFSEED = fractal density random seed (negative integer) ' GAP in envelope ' NO = CGAPE = gap in envelope (YES or NO) 0.3 = RGAPE1 = Inner envelope gap radius (AU) 46.0 = RGAPE2 = Outer envelope gap radius (AU) SCALE = CGAPEDENS = scale density in gap (SCALE) or constant density (CONST) 0.001 = FRACTE = gap density ratio (if CGAPEDENS=SCALE) 1.e-15 = RHOGAPE = gap density if CGAPEDENS=CONST ' Bipolar Cavity Properties ' kmh.par = DUSTNAME(4) = Dust file, thermal grains [cavity] draine_opac_new.dat = DUSTNAME(8) = Dust file, <200A grains [cavity] 0.0 = FSG(4) = Fraction of the mass in <200A grains [cavity] YES = CHOLE = YES for cavity carved out, o/w NO cavity POLYN = CSHAPE = POLYN if polynomial-shaped cavity, STREAM if streamline (requires ENVTYPE=ULRICH) 300.0 = RCHOLE = Streamline RC size in AU (if CSHAPE=STREAM, ENVTYPE=ULRICH) 90.000 = THET1 = Opening angle of inner cavity surface 1.5 = EX1 = Cavity shape exponent inner surface, ~ w^ex1 0.0 = Z01 = z-intercept (AU) inner cavity surface at w=0 (can offset either way) 90.000 = THET2 = Opening angle of outer cavity surface (if CSHAPE=POLYN, otherwise uses streamline) 1.5 = EX2 = outer Cavity surface exponent (if CSHAPE=POLYN), ~ w^ex2 0.0 = Z02 = z-intercept (AU) outer cavity surface at w=0 (can offset either way) 0.0 = EXF = Exponent for cavity density power-law: rho=rhoconst1*r^(-exf) 1.67e-20 = RHOCONST1 = Coefficient for inner cavity density distribution 1.67e-20 = RHOCONST2 = Coefficient for outer cavity density distribution

' SG properties ' 0.09 = RMIN_SG = minimum radius of <200A grains (units of CRMIN)

' Output Data ' YES = IMFILT = Output images convolved with filters NO = IMCUBE = Output multi-wavelength data cubes (very cool, try this) 149 = NXIMG = number of pixels in one dimension of square image 1 = NPEEL = Number of peel-off angles for SEDs and Images 1 = NAP = Number of apertures for SEDs

= RMAXI = Image half-size in AU
= APMIN = Radius of smallest aperture in AU
= APMAX = Radius of largest aperture in AU 30.0 = THETE = For CPEEL=YES, theta angle(s) (deg) of high S/N image(s)/SED(s) 0.0 = PHIE = For CPEEL=YES, phi angle(s) (deg) of high S/N image(s)/SED(s) 180 = NMU = Number of theta image/SED bins 1 = NPH = Number of phi image/SED bins 300 = NFREQ = Number of frequencies for images/SEDs

' Advanced ' YES = DIFFUSION = Whether to use the diffusion

' Grid parameters ' 400 = NRG = Number of radial cells 197 = NTG = Number of theta cells (use an odd number) (set to 3 for 1-D models). 2 = NPG = Number of phi cells (set to 2 for 1-D or 2-D models).

COMPLETION STATUS

When the simulation completes one or more fits files are produced as well as a log file I can use the existence of these files (and their contents) to update the records in the database to say that the simulation is: A - not yet run, B completed successfully, C completed but failed.

manodeep commented 7 years ago

I have never created an SQL database. May be @Psykar, @amraziz8, @gpoole, @macrocosme know?

macrocosme commented 7 years ago

I'll try to go through your points here. @amraziz8 may want to clarify some details from a gSTAR standpoint.

Should I run the database server on my own iMac or on Gstar?

If it's small, your machine can be fine. However, you seem to mention that it's growing quite drastically. So perhaps it may be better to start a DB via ITS.

Which RDBMS should I run?

If you know MySQL already, I would say, go for it!

I've installed MySQL on my iMac but what is available on the Swinburne supercomputer?

MySQL is available, and I think postgres is available now too.

How do I install a python-db connector library on g2

I would say to go with sqlalchemy.

Should I include the fits output files as a BLOBs in the database or simply point to their location in the file system?

I would point to their location.

Are there issues with ports/security if I run this on my iMac?

Not impossible! I will let Amr comment on this!

How can I conveniently query the database and return a formatted report (at least a CSV file) I can easily view/print.

SQLAlchemy. Or, MySQL workbench.

What's the overhead in doing all this? Is it worth it/the right solution?

You'll have to get use to the details and set up/conceptualize your database. Once all is set up though, you'll have quick access to all your results and be able to run some nice analytics on your results.

Hope this helps!

manodeep commented 7 years ago

From chatting with @GeoffBryan, it seems that the issue is resolved. Closing.

caseresearch / code-review