NationalGenomicsInfrastructure / taca-ngi-pipeline

A TACA plugin adding functionality relevant for the ngi_pipeline
MIT License
3 stars 13 forks source link

TACA deliver updates #35

Closed senthil10 closed 6 years ago

senthil10 commented 6 years ago

This PR mainly addresses the following issues

I will explain them in details in new comments (one for each issue) for more readability 😅 Of course can be discussed further along the review

EDIT: They are already tested to some extent and it works as intended 👍

senthil10 commented 6 years ago

ISSUE 1 - GRUS related:

Currently hard staging step only hard stages sample directories, reports have to be copied manually after hard staging. After this PR, it will compile a list of files based on staged directory and ask the user if OK to hard stage them.

Example:

(NGI) Stage/171006.d876f53 [senthilp@irma4]$ taca -c /lupus/ngi/staging/171006.d876f53/conf/TACA/sthlm_taca_fastq_delivery.yml deliver --cluster grus project P7857
This project has been marked as SENSITIVE (option --sensitive). Do you want to proceed with delivery? y
2017-10-26 16:04:05,567 - taca_ngi_pipeline.deliver.deliver_grus - INFO - Proceeding with delivery of P7857 to GRUS with mover. Project marked as SENSITIVE=True

Following data will be delivered, go through list carefully and accept it if right. Cancel the delivery and cleanup before proceeding if there are unintended files

Project stagepath: /lupus/ngi/staging/wildwest//ngi2016003/nobackup/NGI/DELIVERY/P7857
Samples: P7857_201
Miscellaneous: 00-Reports, miscellaneous.lst, ACKNOWLEDGEMENTS.txt, miscellaneous.md5, 01-QC-Results

Proceed with delivery ?

To be able to deliver multiple projects to same GRUS project, really helpful for denovo cases where there are multiple sub-projects but it makes sense to deliver them to same delivery project as they are basically one project. It can also be useful when there are some projects need to be delivered together (for instance couple of projects for same PI)

Example:

(NGI) Stage/171006.d876f53 [senthilp@irma4]$ taca -c /lupus/ngi/staging/171006.d876f53/conf/TACA/sthlm_taca_fastq_delivery.yml deliver --cluster grus project P7857 --include-project P7858 --include-project P7859
2017-10-26 16:09:41,586 - taca_ngi_pipeline.cli - WARNING - Project(s) are (P7858 P7859) given by '--include-project' but no '--delivery-name' given. Will use 'P7857-P7858-P7859' as delivery name
This project has been marked as SENSITIVE (option --sensitive). Do you want to proceed with delivery? y
2017-10-26 16:09:58,605 - taca_ngi_pipeline.deliver.deliver_grus - INFO - Proceeding with delivery of P7857 to GRUS with mover. Project marked as SENSITIVE=True

Following data will be delivered, go through list carefully and accept it if right. Cancel the delivery and cleanup before proceeding if there are unintended files

Project stagepath: /lupus/ngi/staging/wildwest//ngi2016003/nobackup/NGI/DELIVERY/P7859
Samples: P7859_201, P7859_202
Miscellaneous: 00-Reports, miscellaneous.lst, ACKNOWLEDGEMENTS.txt, miscellaneous.md5, 01-QC-Results

Project stagepath: /lupus/ngi/staging/wildwest//ngi2016003/nobackup/NGI/DELIVERY/P7858
Samples: P7858_201
Miscellaneous: 00-Reports, miscellaneous.lst, ACKNOWLEDGEMENTS.txt, miscellaneous.md5, 01-QC-Results

Project stagepath: /lupus/ngi/staging/wildwest//ngi2016003/nobackup/NGI/DELIVERY/P7857
Samples: P7857_201
Miscellaneous: 00-Reports, miscellaneous.lst, ACKNOWLEDGEMENTS.txt, miscellaneous.md5, 01-QC-Results

Proceed with delivery ?
senthil10 commented 6 years ago

ISSUE 2 - Non redundant delivery:

The way now it works is, if any data to be delivered it should be in the section files_to_deliver in the config file.

Example config:

files_to_deliver:
        -
            - <DATAPATH>/<SAMPLEID>/*/*
            - <STAGINGPATH>/<SAMPLEID>/02-FASTQ
            - required: True
        -
            - /temp/irma_setup/ngi_stockholm/nobackup/NGI/softlinks/ACKNOWLEDGEMENTS.txt
            - <STAGINGPATH>

So the ACKNOWLEDGEMENTS.txt will staged/delivered every time for each sample (5 time if 5 samples). For a text file it is fine, it doesn't make much difference computationally but if we want to deliver any analysis results (other than WGS which would reside inside the sample's directory) it be very time consuming as the same results will be delivered again and again for each sample. This will look for new section misc_files_to_deliver and deliver those after processing the samples. And the responsible class method is called only for project level delivery.

Example config:

misc_files_to_deliver:
        -
            - /temp/irma_setup/ngi_stockholm/nobackup/NGI/softlinks/ACKNOWLEDGEMENTS.txt
            - <STAGINGPATH>
        -
            - <ANALYSISPATH>/reports/*
            - <STAGINGPATH>/00-Reports
senthil10 commented 6 years ago

ISSUE 3 - Updating meta info:

After the PR, the delivery script will look for a section save_meta_info in config. If available it will save the available metainfo (created while staging) in statusdb, if not it would complain so this should not affect NGI uppsala

Example config:

save_meta_info:
        status_db_credentials: "/Users/senthilkumarpanneerselvam/.ngi_config/statusdb.yaml"
        files_interested:
            - ".fastq"
            - ".bam"
senthil10 commented 6 years ago

@b97pla and @Hammarn Would be great if this PR is reviewed before next week. As I am leaving for vacation the week after, I can include this in next version of IRMA deployment. I know it looks like a really big PR, but I believe it should be quite straight forward. Tack :)

senthil10 commented 6 years ago

FYI: In irma4 the latest staging version has my version of taca-ngi-pipeline. So it could be used if the code need to be test 👍

source /lupus/ngi/staging/latest/conf/sourceme_{sthlm/upps}.sh
source activate NGI
senthil10 commented 6 years ago

Thank you @b97pla for the review. I thought would be better to have a summarising comment for the major concerns.

1) Keep updating statusdb out of this repo

Yes, it might be neat. But why we kept it here cause we are only interested in what we delivered. For the time being I think it can be here and in future instead of updating it to DB, I will make it to dump it in a JSON file and then use external program to pickup that JSON file.

2) Use rsync functionality in Deliverer to hard stage

It was initially developed by @vezzi and I think the reason why he did not just rsync the soft staged directory is cause he does some checks for each sample before hard staging.

3) Redundant code - deliver_misc_data

As I explained in the comment earlier I was not able to do it better without actually changing the native code as some specific things are hardcoded. I can try to make it better if ok to change the native code.

Let me know your thoughts @b97pla :)

senthil10 commented 6 years ago

@b97pla made this new commit for redundancy issue. Thank you for your suggestion regarding a sub class (ProjectMiscDeliverer). Even now it might look a little redundant but it is not avoidable due to original structure, of course definitely better solution than before :)

senthil10 commented 6 years ago

@b97pla and @Hammarn I will close this PR now and make a small PR just regarding copying miscellaneous files (as that would be pretty small change and we like ti to have ASAP).

After I come back from vacation I will do remaining stuff one by one slowly and in a better way 👍 Thank you very much @b97pla for good review :)