chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

merging of demuxed fastq files and project-based analyses #48

Closed percyfal closed 12 years ago

percyfal commented 12 years ago

Hi Brad,

more of a question than an issue. I noticed you've added code (bcbio.pipeline.sample.merge_sample) to merge samples across lanes. I've been using save_diskspace=true in order to remove sam files, but this I noticed also removes the demultiplexed files, right? I just want to make sure because it affects our data delivery routines, as outlined below.

In our setup, we have situations when we run several projects on one lane, which we distinguish with an extra "description" tag in run_info, so in principle each barcode could have a description with a different project name. We then partition fastq files in a lane based on the description tag when delivering data to customers.

On a similar note, when I do analyses for customers, I've been doing it on a project-by-project basis (it makes more sense to me), and therefore written helper scripts (project_*, see EDIT: https://github.com/percyfal/bcbb/tree/develop/nextgen/scripts) for this purpose. project_analysis_pipeline.sh is almost a copy of automated_initial_analysis.py, but starts off with demultiplexed files. Have you had this functionality in mind (or is it even already there)?

Cheers,

Per

chapmanb commented 12 years ago

Per; You can avoid removing the demultiplexed fastq files if the configuration variable 'upload_fastq' (under algorithm) is set to true. This lets the process know that you need those downstream.

Multiplexing should be handled in the current automated_initial_analysis.py script. You want to specify the multiplex details in your input run_info.yaml:

https://github.com/chapmanb/bcbb/blob/master/nextgen/config/run_info.yaml#L33

For the per lane merging, if you set a different name for the items in each multiplexed name they will be kept separate for downstream process. The merging is only a convenience in case you have a sample on multiple lanes. So keep each uniquely identified by the name and they will stay separate throughout the process.

It sounds like this is what you are trying to accomplish so it might just take a bit of tweaking in the run_info.yaml. Let me know if I've misunderstood or you need any other details. Thanks, Brad

percyfal commented 12 years ago

Ok, thanks for the info, I'll try out the 'upload_fastq' setting.

When it comes to the multiplexing we do use the multiplexing details. I realize I left out two important parts in the description of my analysis setup:

1) I copy the demultiplexed files to a separate project folder residing elsewhere in the directory tree. For me this just makes it conceptually simpler, especially when a project is run on several flowcells where it makes more sense to put the data in a project folder.

2) Starting off from 1), I don't want to rely on the raw data info in the flowcell (/Data/Intensities/BaseCalls/fastq/*fastq.txt). This is the case since we archive the raw data on tape pretty soon after sequencing, whereas analysis of a project may continue for quite some while. I've tried to run automated_initial_analysis, but it complains about not finding the fastq.txt files - therefore, my comment on starting off from "demultiplexed files". I could of course create a "mock" raw data directory with empty files, but I opted for writing a project based analysis script.

In any case, my project_analysis* scripts solve these issues. Any thoughts about this setup?

Cheers,

Per

chapmanb commented 12 years ago

Per; That makes good sense. If your fastq files are organized differently, you can specify the names of the files directly in the run_info.yaml with 'files'. Here is an example (with a BAM file, but they can also be fastq inputs):

https://github.com/chapmanb/bcbb/blob/master/nextgen/tests/data/automated/run_info-bam.yaml

Then you just need to put all the fastqs for a project into a directory (that is passed to automated_initial_analysis.py) and specify the association of the files with the samples in the run_info.yaml.

I use this setup with all of our projects that come from other sources and it's pretty flexible. Let me know if this doesn't work for you and I'm happy to generalize more, Brad

percyfal commented 12 years ago

I should have known you had come up with something smart for this situation. I tried it and it works beautifully. Just a couple of questions:

1) does the fastq-containing directory have to conform to a flowcell name (i.e. date_flowcell)?

2) can filenames include directories (relative or absolute) in the 'files'?

Per

chapmanb commented 12 years ago

Per; Awesome, glad that works for you.

  1. No need to conform to any type of naming convention. They just have to match what you put in the YAML file.
  2. It does support relative directories, and I just checked in support for absolute as well: https://github.com/chapmanb/bcbb/commit/b79d157a99ab68d5953762afb308c2786cadd276

Let me know if you run into anything else at all. Thanks again for the feedback, Brad

percyfal commented 12 years ago

Hi again,

I have a new problem with the demultiplexing. I'd better explain with an example run_info.yaml:

details:

  • description: project1 files: [sample1_1_fastq.txt, sample1_2_fastq.txt, sample2_1_fastq.txt, sample2_2_fastq.txt] lane: '1' analysis: my_analysis genome_build: hg19

Initially this file had a 'multiplex' section specifying barcode ids etc. However, this led to automated_initial_analysis.py to demultiplex the already demultiplexed files... Removing the multiplex section on the other hand, leads to a merge of sample1 and sample2, which is not what I want: that several samples per lane be kept separate in the downstream process, and that demultiplexing is not performed. Currently I use the bcbb internal barcode ids to separate samples - could this be freetext when using the "files" key?

I've browsed the code and it would seem that a lane-level flag (e.g. "demultiplex: false") is needed to stop "split_by_barcode". I'm just wondering at what level is approriate - I guess that "process_lane" needs to be run to collect lane info, so it should go there?

Cheers,

Per

chapmanb commented 12 years ago

Per; If you have everything demultiplexed and want to run them separate, you just need unique descriptions and lanes for each:

details:
  - description: sample1
    files: [sample1_1_fastq.txt, sample1_2_fastq.txt]
    lane: '1'
    analysis: my_analysis
    genome_build: hg19
  - description: sample2
    files: [sample2_1_fastq.txt, sample2_2_fastq.txt]
    lane: '2'
    analysis: my_analysis
    genome_build: hg19

Without the multiplexing information then it won't multiplex and run straight as is. Does this do what you want?

percyfal commented 12 years ago

I actually came to that conclusion now, as I was trying to add a demultiplex flag to "demultiplex.py" and realized that the resulting filenames will conflict with the downstream analysis. Well, as long as there is no limit to 8 lanes in the pipeline :) Merging different samples into one bam file would work if read groups were used (they're not, right?).

Anyway, many thanks for your input.

/P

chapmanb commented 12 years ago

Per; There is no limit on lane numbers: they are only used to keep the filenames unique in the pipeline during analysis since you can't rely on unique sample names or input filenames.

The pipeline does add run group information to the BAM files, since this required for GATK. I'm a bit confused by your comments on merging: do you want them merged or kept separate? Unique names will keep them separate while using the same names will result in merging after alignment: so you can get either behavior.

percyfal commented 12 years ago

I was referring to merging different samples in one bam file, not same sample over several lanes (if that is what you meant). Sorry for the confusion (it's late here...). I've launched my jobs and it works fine now, I'll be running the samples separately for now. I have a much better understanding of how to set it up now.

Cheers

chapmanb commented 12 years ago

Per; Ah sorry, that makes perfect sense now. The pipeline doesn't do any creation of multi-sample BAM files. I'd suggest using Picard to do that post-processing. Glad that is all working and thanks for all the feedback.