run-data-managers execution order dependencies based on dbkeys

ieguinoa commented 5 years ago

Hi ,

I have a situation that I'm not sure how to solve using ephemeris. I'm trying to run some data managers that install data associated with a genome build (unique dbkey) and build indexes based on that. Mainly I have 2 types of "source" data: the genome file and the transcriptome file.
Both load an entry in the all_fasta table, which is done running the data_manager_fetch_genome_all_fasta_dbkey data manager. Since both files belong to the same build they should be associated with the same dbkey so my first idea was to first list the data manager run to install the genome file and create the new dbkey, and then run the same data manager to install the transcriptome file associated with the new dbkey created. The problem is if I list these 2 data manager jobs in 1 yaml file then both will be run simultaneously, the first creates the dbkey but by the time the second is run the entry in the dbkeys table is not yet there so it will fail with "dbkey not found".

An example here

genomes:
    - genome_id: Ricinus_communis_JCVI_1.0
      name: Ricinus communis JCVI 1.0
      build_id: Ricinus_communis_JCVI_1.0
      genome: ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_03/Genomes/rco.con.gz
      all_tx_transcriptome: ftp://ftp.psb.ugent.be/pub/plaza/plaza_public_dicots_03/Fasta/transcripts.all_transcripts.rco.fasta.gz

data_managers:
 ##Load the genome fasta
    - id: toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.2
      params:
        - 'dbkey_source|dbkey_source_selector': 'new'
        - 'dbkey_source|dbkey': '{{ item.genome_id }}'
        - 'dbkey_source|dbkey_name': '{{ item.name }}'
        - 'reference_source|reference_source_selector': 'url'
        - 'reference_source|user_url': '{{ item.genome }}'
        - 'sequence_name': '{{ item.name }}'
        - 'sequence_id': '{{ item.build_id }}'
        - 'sorting|sorting_selector': 'as_is'
      items: "{{ genomes }}"
      data_table_reload:
        - all_fasta
        - __dbkeys__

 ## Load the transcriptome fasta
    - id: toolshed.g2.bx.psu.edu/repos/devteam/data_manager_fetch_genome_dbkeys_all_fasta/data_manager_fetch_genome_all_fasta_dbkey/0.0.2
      params:
        - 'dbkey_source|dbkey_source_selector': 'new'
        - 'dbkey_source|dbkey': '{{ item.genome_id }}'
        - 'reference_source|reference_source_selector': 'url'
        - 'reference_source|user_url': '{{ item.all_tx_transcriptome }}'
        - 'sequence_name': '{{ item.name }} - Transcripts'
        - 'sequence_id': '{{ item.build_id }}_all_tx_transcriptome'
        - 'sorting|sorting_selector': 'as_is'
      items: "{{ genomes }}"
      data_table_reload:
        - all_fasta

I was wondering if there is a way to set this kind of precedence when running data managers in ephemeris, somehow similar to what happens with the fasta files and the indexes based on them .Or what would be a good practice for this situation? So far I split the yaml file in 2 parts: one for the genome fasta (and creating the dbkey) and the indexes based on it, and one with the transcriptome fasta and the indexes based on it.

Cheers, Ignacio

rhpvorderman commented 5 years ago

I have never had this use case. But I believe it can be fixed quite easily. So if I understand that dbkeys items need to be run first. Always. Is that correct @ieguinoa ?

ieguinoa commented 5 years ago

Yes, the solution would be to identify which calls to data managers try to create a new dbkey and run these first. In my case I always use the data_manager_fetch_genome_all_fasta_dbkey with the parameter 'dbkey_source|dbkey_source_selector': 'new' to create the dbkeys but, although I think this is somehow the standard way, other data managers can be available to create dbkeys. Waiting for the jobs with this pattern of data manager and dbkey_source=new can be a good approach but maybe not a definitive solution.

Thanks for the help! Ignacio

bgruening commented 5 years ago

@ieguinoa this should actually work. have a look here: https://github.com/bgruening/idc/blob/master/idc-workflows/ngs.yaml#L13

It runs multiple DM in a row with reloading the tables.

rhpvorderman commented 5 years ago

@bgruening The reloading tables function is actually deprecated. We now require the watch tool data dir in the galaxy/yml (.ini) be set to true.

The implementation has changed. So now run-data-managers will first run all the managers that populate the fasta table (as determined from the data_table_reload field). Then all the other data managers are run. The code is here It is quite modular already. So it should be fixable. I can not promise to immediately start working on this though.

ieguinoa commented 5 years ago

The reload_table is actually to indicate that the table should be reloaded after that job is run, but the following jobs are run without waiting for the first to finish so the dbkey dependency is still there.

@rhpvorderman I can help with this, I mainly wanted to be sure if the approach I mentioned was correct: identifying which jobs within the source set (i.e those that modify the all_fasta data table) introduce a new dbkey and then run these first: new_dbkey -> other source jobs -> indexing jobs

rhpvorderman commented 5 years ago

@ieguinoa Can you try version 0.8.0? It should start all the jobs in order of the YAML. This new feature with it loading as much of jobs as possible was added in 0.9.0

ieguinoa commented 5 years ago

I tried it already but,if I remember correctly, it was still failing because the 1st job may not finish in time for the next one to use the dbkey it created. Pretty much what happened with the source/indexing before 0.9

rhpvorderman commented 5 years ago

Did you follow the recommendations from the manual?

This functionality depends on the “watch_tool_data_dir” setting in galaxy.ini to be True. Also, if a new data manager is installed, galaxy needs to be restarted in order for it’s tool_data_dir to be watched.

ieguinoa commented 5 years ago

I tried again with v0.8 and it works ok:

Each data manager entry is run in a different job consecutively (for all the different genomes together) and, if watch_tool_data_dir is set, the dbkey is available for the next dm. So, I was wrong before, the problem is only with v0.9 which only splits the executions in all_fasta (sources) and rest(indexing).

For the moment I am still splitting the config files in 2: one which sets the dbkeys and genome and the other for the rest. Mainly because I don't want to set any requirement on the ephemeris version, v0.9 is much faster as it submit most of the jobs together. But, for example, 0.8 is the latest version available in conda.

So, if there is a chance that the next version approaches this specific use case that would be great.

Thanks for the help! Ignacio

rhpvorderman commented 5 years ago

@ieguinoa Thanks for letting us know that it works now.

The consecutive ordering is indeed more foolproof but also a lot slower. I was annoyed with this so I parallelized run-data-managers in this way. But since this probably does not fit all use cases as your report demonstrates, a more fine-grained control would be better.

It might be a while before this is implemented. This is what needs to happen:

[ ] A test case representing your use case must be written
[ ] An elegant method of user control must be thought up. Something that is easy and inituitive that allows you to set the order. I am thinking of a priority key in the yaml. Stuff is run in the order of priority. If no priority is assigned it is run last. If no priorities are assigned at all it will use the v0.9.0 behaviour.
[ ] code must be implemented and fully tested.

To speed up the testing we have some work pending on the galaxy representation in our testing. So this might delay the time until we can fix this issue. Luckily for now you have a workaround. We will give you a headsup when we fix the issue.

galaxyproject / ephemeris

run-data-managers execution order dependencies based on dbkeys #113