NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework
9 stars 9 forks source link

Directory tree switch #6

Closed mariogiov closed 10 years ago

mariogiov commented 10 years ago

We discussed previously that at some point we would like Stockholm and Uppsala to use the same project directory structure. In our standard Production environment, we will not make this switch until the pipeline infrastructure we're developing here replaces the old one; however, for the IGN samples, I have already changed (or perhaps reverted is a better word) our file naming system to match Uppsala's/Illumina's. This actually makes things a little more complex at this stage because the sthlm2UUSNP script expects the old format (I symlink new files to match this format).

@johandahlberg, I don't know how much work this will be on your end but should we consider making the switch in Piper sometime soon? I think some of the problems we're having now relate to difficulties in switching from one style of directory tree to the other. If there are more pressing development tasks at the moment though this can of course wait as it is "non-blocking."

Just to summarize, the directory structure / file naming system I understand we were planning to adopt is:

T.Durden_14_01
└── P1142_101
    └── 140528_BC423WACXX
        ├── P1142_101_NoIndex_L001_R1_001.fastq.gz
        ├── P1142_101_NoIndex_L001_R2_001.fastq.gz
        ├── P1142_101_NoIndex_L002_R1_001.fastq.gz
        ├── P1142_101_NoIndex_L002_R2_001.fastq.gz
        ├── P1142_101_NoIndex_L003_R1_001.fastq.gz
        ├── P1142_101_NoIndex_L003_R2_001.fastq.gz
        ├── P1142_101_NoIndex_L004_R1_001.fastq.gz
        └── P1142_101_NoIndex_L004_R2_001.fastq.gz

Which is to say:

<project_name>
└── <sample_name>
    └── <date_fcid>
        └── <sample>_<index>_<lane>_<read_num>_<whatever>.fastq.gz
vezzi commented 10 years ago

Hej Mario, anyway we will start the pipeline from the original demultiplexed folder so from there you can organise as you wish the folder structure. I do not think you need to double convert them you can allow specify an option to turn data in one format or in a another

F:

On 10 Jul 2014, at 09:38, Mario Giovacchini notifications@github.com wrote:

We discussed previously that at some point we would like Stockholm and Uppsala to use the same project directory structure. In our standard Production environment, we will not make this switch until the pipeline infrastructure we're developing here replaces the old one; however, for the IGN samples, I have already changed (or perhaps reverted is a better word) our file naming system to match Uppsala's/Illumina's. This actually makes things a little more complex at this stage because the sthlm2UUSNP script expects the old format (I symlink the files matching this format).

@johandahlberg, I don't know how much work this will be on your end but should we consider making the switch in Piper sometime soon? I think some of the problems we're having now relate to difficulties in switching from one style of directory tree to the other. If there are more pressing development tasks at the moment though this can of course wait as it is "non-blocking."

Just to summarize, the directory structure / file naming system I understand we were planning to adopt is:

T.Durden_14_01 └── P1142_101 └── 140528_BC423WACXX ├── P1142_101_NoIndex_L001_R1_001.fastq.gz ├── P1142_101_NoIndex_L001_R2_001.fastq.gz ├── P1142_101_NoIndex_L002_R1_001.fastq.gz ├── P1142_101_NoIndex_L002_R2_001.fastq.gz ├── P1142_101_NoIndex_L003_R1_001.fastq.gz ├── P1142_101_NoIndex_L003_R2_001.fastq.gz ├── P1142_101_NoIndex_L004_R1_001.fastq.gz └── P1142_101_NoIndex_L004_R2_001.fastq.gz Which is to say:

└── └── └── ____.fastq.gz — Reply to this email directly or view it on GitHub.
mariogiov commented 10 years ago

@vezzi part of this issue is also motivated by self-interest as I'm trying to avoid maintaining a separate branch for each filesystem setup.

mariogiov commented 10 years ago

This is a good point though -- it occurs to me now that I don't know how most analysis engine tools we might want to use will expect the data to be organized. Hm.

vezzi commented 10 years ago

We need to have a single way to represent the projects and I think that Stockholm representation with Uppsala (i.e. Illumina) naming convention is the best.

On the other hand, if the engines start from the demultiplexed folder we can organise our data as we wish, but I would prefer to have a common structure for better future maintenance.

Under

/proj/a2010002/INBOX

I am saving the Illumina flowcell with demultiplexed data for our 9 30X whole human genomes. I am still waiting for the A.Wedell_13_03 but they should be there before today!!!!

I will then fix the group owner to a2010002 so that all of us can access it. I will keep under

/proj/a2010002/nobackup/INBOX/

the data I have transferred so far, including the Stockholm formatted 30X projects. That folder will be ready late this afternoon

johandahlberg commented 10 years ago

@mariogiov Making the switch in piper shouldn't be to much of a problem (but it is of course something that needs to be done eventually). The greater issue is making the changes to Sisyphus - and I think that that will have to wait until after the vacations. So I'd guess that this is something that we could have in production late August, early September.

But I guess that you still have a point that if you want to keep things engine agnostic you would still need a way to convert to different formats. Being a Java/Scala guy I go with a class to represent a project and then have each "engine adaptor" fix it's own representation. I guess that the same kind of scheme is possible in python (if it quacks like a duck it is a duck, and all that jazz)...

johandahlberg commented 10 years ago

This should be fixed by aea7e2ec079fc7127860b6114ee29577c3a981bc. @mariogiov can try it out then I'll close this.