NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework
9 stars 9 forks source link

sthlm2UUSNP problem on M.Kaller #22

Closed vezzi closed 9 years ago

vezzi commented 10 years ago

So I am trying to run some large test but I noticed that there is a bug in sthlm2UUSNP

When running this command:

sthlm2UUSNP -i /proj/a2010002/nobackup/NGI/analysis_ready/DATA/M.Kaller_14_06/ -o /proj/a2010002/nobackup/NGI/analysis_ready/DATA_UUSNP/M.Kaller_14_06

I get the following, correct, folder structure:

DATA_UUSNP/M.Kaller_14_06/
`-- 140702_AC41A2ANXX
    |-- report.tsv
    |-- Sample_P1171_102
              .....
    |-- Sample_P1171_104
              .....
    |-- Sample_P1171_106
              .....
    `-- Sample_P1171_108
              .....

but the report.tsv file looks like this:

#SampleName     Lane    ReadLibrary     FlowcellId
P1171_108       1       A       AC41A2ANXX
P1171_108       2       A       AC41A2ANXX
P1171_108       3       A       AC41A2ANXX
P1171_108       4       A       AC41A2ANXX
P1171_108       5       A       AC41A2ANXX
P1171_108       6       A       AC41A2ANXX
P1171_108       7       A       AC41A2ANXX
P1171_108       8       A       AC41A2ANXX

only one sample is present. 24 lines (8 for each sample) are missing. THis clearly causes piper to fail.

In case you need it here you can see the complete folder structure (IGN format)

tree  DATA/M.Kaller_14_06/
DATA/M.Kaller_14_06/
|-- P1171_102
|   `-- A
|       `-- 140702_AC41A2ANXX
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L001_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L001_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L002_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L002_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L003_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L003_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L004_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L004_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L005_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L005_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L006_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L006_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L007_R1_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L007_R2_001.fastq.gz
|           |-- P1171_102_ATTCAGAA-CCTATCCT_L008_R1_001.fastq.gz
|           `-- P1171_102_ATTCAGAA-CCTATCCT_L008_R2_001.fastq.gz
|-- P1171_104
|   `-- A
|       `-- 140702_AC41A2ANXX
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L001_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L001_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L002_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L002_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L003_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L003_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L004_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L004_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L005_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L005_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L006_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L006_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L007_R1_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L007_R2_001.fastq.gz
|           |-- P1171_104_ATTCAGAA-GGCTCTGA_L008_R1_001.fastq.gz
|           `-- P1171_104_ATTCAGAA-GGCTCTGA_L008_R2_001.fastq.gz
|-- P1171_106
|   `-- A
|       `-- 140702_AC41A2ANXX
|           |-- P1171_106_GAATTCGT-CCTATCCT_L001_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L001_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L002_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L002_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L003_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L003_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L004_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L004_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L005_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L005_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L006_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L006_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L007_R1_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L007_R2_001.fastq.gz
|           |-- P1171_106_GAATTCGT-CCTATCCT_L008_R1_001.fastq.gz
|           `-- P1171_106_GAATTCGT-CCTATCCT_L008_R2_001.fastq.gz
`-- P1171_108
    `-- A
        `-- 140702_AC41A2ANXX
            |-- P1171_108_GAATTCGT-GGCTCTGA_L001_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L001_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L002_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L002_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L003_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L003_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L004_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L004_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L005_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L005_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L006_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L006_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L007_R1_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L007_R2_001.fastq.gz
            |-- P1171_108_GAATTCGT-GGCTCTGA_L008_R1_001.fastq.gz
            `-- P1171_108_GAATTCGT-GGCTCTGA_L008_R2_001.fastq.gz
johandahlberg commented 10 years ago

I'll look into this tomorrow.

johandahlberg commented 10 years ago

@vezzi Checkout 64a892b203104e947e504841873031753afa07ef, I think that this is fixed now. I tested on the files that you sent me, and everything looks ok now. Let me know if there are any further problems with this, or if we can close this issue.

vezzi commented 10 years ago

Great Johan... I am pretty close to start updating charon with alignment results and after that I only need to write the part that triggers the best practice analysis....

Maybe Friday I will be able to run the pipeline on the 7 samples we have so far!!!!!

johandahlberg commented 9 years ago

:thumbsup: Can you confirm that this is working now? Just want to make sure that I haven't missed anything. Looking forward to them moment when all parts are in place and we can push a run through it.

vezzi commented 9 years ago

I will give a try to it this afternoon right now I do not want to mess too much with my current test folder..

johandahlberg commented 9 years ago

Ok. :smile_cat:

vezzi commented 9 years ago

Not yet solved....

sthlm2UUSNP -i /proj/a2010002/nobackup/NGI/analysis_ready/DATA/M.Kaller_14_06/ -o /proj/a2010002/nobackup/NGI/analysis_ready/DATA_UUSNP/M.Kaller_14_06
wc -l /proj/a2010002/nobackup/NGI/analysis_ready/DATA_UUSNP/M.Kaller_14_06/140702_AC41A2ANXX/report.tsv
33 
sthlm2UUSNP -i /proj/a2010002/nobackup/NGI/analysis_ready/DATA/M.Kaller_14_06/ -o /proj/a2010002/nobackup/NGI/analysis_ready/DATA_UUSNP/M.Kaller_14_06
wc -l 
65

can you remove the append and recreate each time the file

The problem is that right now I am recreating the tsv file one time for each sample.. this needs to change but for now it simplifies my life

johandahlberg commented 9 years ago

Do you run it without removing the old folder structure first? That will probably cause this problem. I'd recommend removing the old folder structure and then recreating it. Since it's all hard links there is really no cost to doing so.

However if it's important to you that running the same command more than once and appending the file only with the new information, I can fix that by adding some extra logic to the app.

vezzi commented 9 years ago

The second time I rerun it I do it without removing the previously created folder structure.

I need it?... right now I do not really like what I do: given a flowcell I scan one by one the samples of that flowcell and I do:

clearly rerun sth2UUSNPSEQ for each sample is not optimal but avoids me, in this moment, to add too much logic and check if for that flowcell I have already created or not the UUSNPSEQ folder structure.

Anyway, when a new flowcell is delivered I need to rerun Sthl2UUSNPSEQ and this will recreate the current problem. I cannot delete the folder structure as in that moment it could be that the data is used by a running instance of piper.

The best solution would be to call sthl2UUSNPSEQ at flowcell level, like this

sthl2UUSNPSEQ NGI_project_format UUSNPSEQ_dir FLOWCELL

this will create a new directory in UUSENPSEQ_dir with the new FLOWCELL run. In this way I can check if UUSNPSEQ_dir/FLOWCELL exists or not and decide if run or not the command. At that point you are not required to add extra logic to avoid append already existing fileds.

I do not know which one is the best solution for you (or the simplest). I would prefer the second one (build FLOWCELL specific UUSNPSEQ folders) but if for you modify the current version is easier is fine with me.

johandahlberg commented 9 years ago

Just checking that this is what you want, and if so it should be easy to implement:

./sthlm2UUSNP  --input_root <sthlm project root folder>  --out_root <root of uppsala style project> --flowcell <restrict the creation to the following flowcell>

Just give me a thumbs up that this is what you want, and I'll fix it asap.

vezzi commented 9 years ago

Thumbs up

Sent from my iPad

On 11 Aug 2014, at 16:09, Johan Dahlberg notifications@github.com wrote:

Just checking that this is what you want, and if so it should be easy to implement:

./sthlm2UUSNP --input_root --out_root --flowcell Just give me a thumbs up that this is what you want, and I'll fix it asap.

— Reply to this email directly or view it on GitHub.

johandahlberg commented 9 years ago

This should be fixed from: v1.2.0-beta16

There is an example of how to run it:

./target/pack/bin/sthlm2UUSNP --input_root src/test/resources/testdata/Sthlm2UUTests/sthlm_runfolder_root --out_root test --flowcell 140710_AC41A2ANXX

@vezzi test it and see if it does what you want it to do.

vezzi commented 9 years ago

I tried it, and if possible I would like to overwrite the existing .*tsv file rather then appending it.

If I run

sthlm2UUSNP --input_root /proj/a2010002/nobackup/NGI/analysis_ready/DATA/A.Wedell_13_03/ -o /proj/a2010002/nobackup/NGI/analysis_ready/DATA_UUSNP/A.Wedell_13_03/ --flowcell  130611_AH0CCVADXX

on a project I already converted i get:

cat A.Wedell_13_03/130611_AH0CCVADXX/*tsv
#SampleName     Lane    ReadLibrary     FlowcellId
P567_101        1       A       AH0CCVADXX
P567_101        2       A       AH0CCVADXX
P567_101        1       A       AH0CCVADXX
P567_101        2       A       AH0CCVADXX

i.e., the last two lines are repeated. We can check this pipeline side with no big effort (and we will do it anyway to avoid recreating files that have been already created) but to me this appending issue sounds like an unexpected behaviour of the program.

johandahlberg commented 9 years ago

You are absolutely right that this is an unexpected behavior. I'll make sure to fix it now. Will let you know when it's done.

johandahlberg commented 9 years ago

@vezzi Try it now!

vezzi commented 9 years ago

@johandahlberg works @mariogiov sthlm2UUSNP works as expected you can remove the patch I added in piper_ngi

johandahlberg commented 9 years ago

:smiley_cat: