Database for storing run status

guillermo-carrasco commented 9 years ago

Would be nice to have a small and simple database just to save the status of the runs, i.e SEQUENCING, ARCHIVING, ARHCIVED, etc.

The idea is to implement it in such a way that the database backend should be abstract/plugable. Basically define an API (in the Run class probably) with 2 main functions: get_run_status() and set_run_status(status)

@vezzi you can use this issue to discuss implementation and/or define status.

guillermo-carrasco commented 9 years ago

Check this piece in the ngi_pipeline to dynamically load modules.

guillermo-carrasco commented 9 years ago

More notes: Try to make this not mandatory, i.e TACA will try to upload run status to a backend database if such a backend is defined in the configuration file, i.e

db_backend: couchdb
db_credentials:
    user:
    password:    
    url:
    port:

if not defined, TACA shouldn't crash, but only log a WARNING;

No backend database defined, not updating run status

robinandeer commented 9 years ago

Some comments:

I think it sound like a very nice idea to add such functionality to TACA - seems like the right level
I guess you can handle the pluggability also through entry points instead of a custom solution
We are thinking about something similar and one problem has been the different levels we are working on; Run/Flowcell, Lane, Sample, Family, Project... This could get complicated fast and perhaps change over time if for example you back up VCFs instead of fastq-files
If you want to get real fancy, you could default to a SQLite database/YAML file if a proper backend isn't setup

EDIT: Is it possible to create this sub-database so general that both NGI and Clinical could make use of it? I'll try to think about what we would need to store to make it work for us.

guillermo-carrasco commented 9 years ago

Thanks for the comments @robinandeer !

I guess you can handle the pluggability also through entry points instead of a custom solution

I guess its a trade off between CLI cleanness and abstraction, isn't it? For example, if we decide to do this through entry points we still have to implement the different backends, its just that we also have to implement the CLI part, so would be something like taca storage cleanup --backend couchdb or similar. Instead deducing the backend from the YAML file frees us from writing that --backend couchdb.

You may argue that we are doing precisely that for archiving, i.e taca storage archive --backend swestore, however I think that its a different case, is not a technology backend that you are choosing, but an end "physical" place were to place your data, its good to be explicit in this case.

We are thinking about something similar and one problem has been the different levels we are working on; Run/Flowcell, Lane, Sample, Family, Project... This could get complicated fast and perhaps change over time if for example you back up VCFs instead of fastq-files

I don't see how this could affect the idea behind this issue. What we want (at least by now) its a very simple status, i.e DEMULTIPLEXING, ARCHIVING, etc. Does it matter the level you're working at?

If you want to get real fancy, you could default to a SQLite database/YAML file if a proper backend isn't setup

We thought on that as well, the problem is that we don't want a local database because what we want is something that helps the NASs and the processing machines to communicate. For example, remove a run in preproc1 only if it has been archived in swestore (which is done in the NASs).

Is it possible to create this sub-database so general that both NGI and Clinical could make use of it? I'll try to think about what we would need to store to make it work for us.

That was the idea, so good you're in for that ^^! We could define a simple API on the db package, that would then instantiate the correct submodule depending on the backend.

First set of methods proposed for the API:

get_run_status(run): Get the status of a particular run
set_run_status(status): Set the status of a particular run
get_processing_runs(): Get a list of all currently processing (demultiplexing) runs
get_archiving(): Get a list of all currently archiving runs
get_archived(): Get a list of all archived runs

Thanks again! Let's keep discussing this ^^

robinandeer commented 9 years ago

if we decide to do this through entry points ... we also have to implement the CLI part

This is not necessary or else I'm misunderstanding you perhaps :sweat_smile: - You could still deduce the backend from a YAML file! But this isn't such a big deal I guess. If you're interested you can just ask and I'll explain it further

Does it matter the level you're working at?

I see. So we are thinking that we want to backup on a per sample level from now on. We will generate FASTQ-files and just get rid of BCL:s in the near future. I guess my question was if the systems will be flexible to handle this sort of thing?

That was the idea, so good you're in for that ^^!

Awesome! I will clue in busy, busy @ingkebil to give his opinion on the practical aspects :smile:

guillermo-carrasco commented 9 years ago

@robinandeer and me had a discussion yesterday about this. We tried to reach a "consensus" on a design that would work both for NGI and Clinical Genomics (CG). Here it is roughly what we talked about:

API design We need a set of API calls that can work for both of us. CG have a different logic for archiving data long term: They'll wait to get all the runs for a sample and then they will archive at Sample level. Us, instead are archiving at run/flowcell level. This means that we can archive as soon as the run is finished, whilst CG have to wait to have all info for a sample. @robinandeer is that accurate?

This shouldn't matter for the calls that we already proposed, which are (modified to fit the new design):

get_latest_event(entity) see database design : Return the latest event that happened to an entity
set_event(entity, event): Append an event to an entity
get_processing(entity): Get all entities which last event status is processing
get_archiving(entity): Get all entities which last event status is archiving
get_archived(entity): Get all entities that contain an event which status is archived (not necessarily the last event)

Where entity would be run/flowcell for us, sample for CG.

@robinandeer , @vezzi , @senthil10 any API call that you can immediately think of? We can always add more later.

Database design Instead of saving a single state, we (Robin, with my total support) suggest to save an array of events per-entity. This is good for traceability and does not suppose a big effort. Would look something like this:

screen shot 2015-05-06 at 10 43 51

@vezzi do we (you..) plan to replace the flowcells database on a near future? Because this could definitely be the new place where to put all the relevant info, uploaded by TACA from the preprocessing servers.

Code design We need to made the database API agnostic from the backend. This needs further thinking and design, but the idea is, in pseudo-ish code:

class TACADB():
    """ Base class for TACA database.

    Takes care of reading credentials from configuration and instantiating
    the correct backend.
    """"
    def __init__(self):
        # Read config, detect backend
        try:
            instantiate_backend(backend, config)
        except WhateverError:
            logger.error('Could not load backend database, not updating run status') 

    def get_latest_event(entity):
        ...

class CouchDBBackend(TACADB):
    """CouchDB backend for TACA.
    """
    def __init__(self, **config):
        #1. create connection with database
        #2. Check "schema" or database
        #3. Implement API calls

The idea behind this is that it should be fairly easy to add backends to TACA; so we in NGI can develop the one for CouchDB and CG can develop one for... is it MySQL?

I would like other's opinions before moving on! Otherwise we'll have fun on our own ^^

pekrau commented 9 years ago

In my opinion, the "date" entry for an event should always have high resolution, at least to the second, and it doesn't cost much to also store down to millisecond. For debugging purposes and potential future analytics, high-res temporal data is required, and just having date is not good enough. To reflect this, call the field "timestamp".

Also, to avoid complications with daylight savings and timezones, the timestamp should always be in UTC, and be stored explicitly as such, to avoid future confusion. E.g. "2015-04-15T14:11:54.725Z"

robinandeer commented 9 years ago

Database design They way things are stored in the backend isn't so important that we unify I guess.

API design I've started mocking up a class interface plugin: https://github.com/Clinical-Genomics/taca-clinstatdb/blob/master/taca_clinstatdb/api.py

I've made some new suggestions for what the methods should be named that anyone can comment on!

I agree with @pekrau about the dates but I guess we hadn't gotten to the details yet - super!

EDIT: changed link to point to actual plugin module

guillermo-carrasco commented 9 years ago

@pekrau absolutely, that screenshot is only a manually written database entry, totally agree on the date format, ISO format #FTW

@robinandeer excellent!

senthil10 commented 9 years ago

+1 for @pekrau's suggestion and I also have few questions :)

Since this DB is in couch, I assume all the API calls such as get_latest_event(), set_event() would be coded in such a way (couchDB specific) ? So why to have them separately since we already have StatusDB repo for that. Cant we just add these as new connection method for new DB (runs) ?

guillermo-carrasco commented 9 years ago

@senthil10 StatusDB in only the name we give to our instance of CouchDB, so when I say couchdb I mean statusdb, but yes, I want a completely separated database, I don't want to add more stuff into the flowcells database, so I created taca_flowcells just for testing, we can call it whatever, I don't mins as far as its independent :)

We can discuss the implementation, but I don't want to start adding dependencies if they're not 100% needed.

vezzi commented 9 years ago

@guillermo-carrasco and @robinandeer I really like it. I am really fond of solution that can be used at the same time by clinical and NGI, they are the key to optimise the limited human resource we have.

API calls seems ok to me, then once we will start to implement them it will be natural to find new one.

about replacing FC db... the plan with @Galithil, for now, is to check how we can add HiSeqX FC to the flowcellDB, or if it is better to create a new DB. I do not see a real need anyway to move here the old FC DB, it will contain the same data as now plus status info... On the other hand the risk is that we end up using the status FC-name has a key to access th eFC db creating an external key that is exactly how a non relational DB should not be used.....

Anyhow, the discussion on FC-database needs to be hold off for a while, we need first to understand what will happen with HiSeqX FCs

SciLifeLab / TACA

Database for storing run status #102