cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Database name #34

Closed IanSudbery closed 6 years ago

IanSudbery commented 6 years ago

Pipelines mapping, bamstats and genesets still have setting

database:
    name: csvdb

when the code is now looking for database_url and this should be a URL rather than a path.

sebastian-luna-valero commented 6 years ago

Hi Ian,

It looks like Database.py works with both for forward compatibility. It uses sqlite3 for path and sqlalchemy for url.

Most of the code in cgat-flow connect with sqlite3 directly:

$ grep -n 'database_name' -r CGATPipelines | egrep -v 'Binary'
CGATPipelines/PipelineIntervalAnnotation.py:124:    dbhandle = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/PipelineIntervals.py:2611:    dbhandle = sqlite3.connect(P.get_params()["database_name"])
CGATPipelines/pipeline_exome.py:1483:    database = PARAMS["database_name"]
CGATPipelines/pipeline_rnaseqdiffexpression.py:417:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/pipeline_bamstats.py:202:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/PipelineGO.py:69:    database_name
CGATPipelines/PipelineGO.py:98:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/PipelineGO.py:446:    dbhandle = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/PipelineEnrichment.py:306:    dbhandle = sqlite3.connect(P.get_params()["database_name"])
CGATPipelines/PipelineEnrichment.py:544:    dbhandle = sqlite3.connect(P.get_params()["database_name"])
CGATPipelines/pipeline_splicing.py:174:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/pipeline_peakcalling.py:1014:                                     PARAMS["database_name"])
CGATPipelines/pipeline_peakcalling.py:1085:                                     PARAMS["database_name"])
CGATPipelines/pipeline_mapping.py:274:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/pipeline_genesets.py:372:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/pipeline_motifs.py:254:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/pipeline_exome_cancer.py:175:    dbh = sqlite3.connect(PARAMS["database_name"])
CGATPipelines/PipelineGeneset.py:688:    dbhandle = sqlite3.connect(PARAMS["database_name"])

Just found a couple that uses Database.py in cgat-core:

$ grep Database -r CGATPipelines | egrep -v '__pycache__' | grep import
CGATPipelines/PipelineGtfsubset.py:import CGATCore.Database as Database
CGATPipelines/PipelineRnaseq.py:import CGATCore.Database as Database
CGATPipelines/pipeline_peakcalling.py:import CGATCore.Database as DB

$ grep Database.connect -r CGATPipelines
CGATPipelines/PipelineGtfsubset.py:    dbhandle = Database.connect(url="mysql://{user}@{host}/{database}".format(**locals()))
CGATPipelines/configuration/pipeline.ini:# Database connection parameters  
CGATPipelines/configuration/pipeline.yml:# Database connection parameters  

$ grep 'DB\.' -r CGATPipelines                    
CGATPipelines/PipelineReadqc.py:    parser = CSV2DB.buildParser()
CGATPipelines/PipelineReadqc.py:            CSV2DB.run(inf, options)
CGATPipelines/PipelineReadqc.py:        CSV2DB.run(inf, options)
CGATPipelines/pipeline_peakcalling.py:    insert_size = DB.fetch_DataFrame("SELECT * FROM insert_sizes",
CGATPipelines/pipeline_peakcalling.py:    insert_size = DB.fetch_DataFrame("SELECT * FROM insert_sizes",

I think we should encourage new pipelines built on top of cgat-core to use the Database.py module with url, but I am inclined to save time and not refactor the existing code.

Please let me know your thoughts.

Best regards, Sebastian

IanSudbery commented 6 years ago

The most common use of the database is CGATCore.Pipeline.Database.load, which uses database_url.

sebastian-luna-valero commented 6 years ago

I was expecting the changes in https://github.com/cgat-developers/cgat-flow/pull/37 to fail Jenkins tests but it does not. I am confused. Are these changes the solution?

IanSudbery commented 6 years ago

I think so? The problem is that at the moment there are hard coded defaults for these parameters of database_name=./csvdb and database_url=sqlite:///./csvdb, so the problem only manifests if you need to use some value other than these.

sebastian-luna-valero commented 6 years ago

I agree. If you can provide a reproducible example where it fails, I am happy to help and fix it.

sebastian-luna-valero commented 6 years ago

Please re-open if you find an example to reproduce the issue.