caseresearch / code-review

⛔️ [DEPRECATED] A repo for code review sessions at CAS. ⛔️ [DEPRECATED] See
https://github.com/swincas/code-review
MIT License
0 stars 41 forks source link

Create Replacing an Excel Spreadsheet with an SQL database #6

Closed GeoffBryan closed 7 years ago

GeoffBryan commented 7 years ago

Hi Manodeep, Dany,

                         **Replacing an Excel Spreadsheet with an SQL database**

 I run or have run lots of radiative transfer code simulations, using a code called Hochunk on the super-computer. Currently I keep track of them using an Excel spreadsheet which is growing unwieldy so I was thinking it might be useful to replace this with a small SQL database. 

So I was looking for some advice on how to do this (without going overboard) and I thought that it might be a useful thing to talk about at code review as it’s a problem many of us might come across (and using SQL and python together his likely to be a really useful general skill to pick-up).

So what do I want to be able to do is:

  1. Create a database on a database server – should I run this on my own iMac or on Gstar? - I know how to do this on my Mac (using mysql) but what’s available on the cluster?
  2. Design a db schema based on the text parameter files used by the simulation code using a schema design tool such as the SQL workbench for MySQL
  3. Write a python script that I point at subdirectory on g2 where I have a particular simulation. Eg adddbentry this directory to add an entry to the local/remote database. This should search for the parameter file, parse it, check to see if it is a new entry in the db and then if it isn’t add a record to the DB and also record whether the output Fits files are in the directory (I.e. The simulation completed successfully) - but to do this I need a Python-DB connector library where do I find this?

Should I include the fits output files as a BLOBs in the database or simply point to it’s location in the file system? Conveniently query the database and return a formatted report (at least a CSV file) I can easily view/print. One of the reasons for wanting to do this is to stop me running duplicate simulations – they are expensive in time and resources and I don’t want to re-run things I’ve already done – so it makes sense to find a systematic way of determining whether a set of parameters has already been simulated. So if I had query that allows me to put in ranges of values and for various parameters and see if there is a match (or not) in the DB would be useful.

The other thing worth discussing is the overhead involved in learning how to do this and at what point does it become useful to do.

Just a thought,

Cheers,

       Geoff
manodeep commented 7 years ago

Oops. Sorry - I mean to open a new issue. My apologies