chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
610 stars 243 forks source link

Sample id in lane #49

Closed percyfal closed 12 years ago

percyfal commented 12 years ago

For analyses where samples are put in separate lanes, add optional 'sample_id' to lane_name

percyfal commented 12 years ago

Some background: I've successfully set up lane-based sample analyses. However, the naming semantics become confusing. When working on real flowcells, lane is a numerical quantity, and there is little sense including a sample id due to multiplexing. In the other setting, where we are working on a 'virtual' flowcell, the only way to include sample id in the file names in downstream processing is to set e.g. 'lane: 1_SAMPLEID', right? Semantically, it would be clearer if lane is a numerical quantity, IMO. Of course there's a trade-off: the config files will need yet another field.

chapmanb commented 12 years ago

Per; Is your goal to get merging correct for samples in multiple lanes or to put sample names in the output files for distribution? For the first, that is done using the "name" parameter, which seems equivalent to your sample_id:

https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/pipeline/merge.py#L35

The code falls back on "description" if name is not provided.

If the goal is to prepare filenames for researchers with their supplied descriptions, we handle that post-analysis when moving the alignment, VCF and QC files out of the work directory. If you set "description_filename" to true for a sample then the output files will be named via description:

https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galaxy.py#L90

This is not the default since you can't always rely on the descriptions to be unique, but if you are setting them yourself you can use this for the equivalent of sample_id.

Do these work for what you want to do? Brad

percyfal commented 12 years ago

Brad,

it's the second case I'm referring to. I'll look it up. I take it then that you always refer to lane by number?

Thanks again,

Per

On Fri, Feb 17, 2012 at 12:16 PM, Brad Chapman < reply@reply.github.com

wrote:

Per; Is your goal to get merging correct for samples in multiple lanes or to put sample names in the output files for distribution? For the first, that is done using the "name" parameter, which seems equivalent to your sample_id:

https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/pipeline/merge.py#L35

The code falls back on "description" if name is not provided.

If the goal is to prepare filenames for researchers with their supplied descriptions, we handle that post-analysis when moving the alignment, VCF and QC files out of the work directory. If you set "description_filename" to true for a sample then the output files will be named via description:

https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galaxy.py#L90

This is not the default since you can't always rely on the descriptions to be unique, but if you are setting them yourself you can use this for the equivalent of sample_id.

Do these work for what you want to do? Brad


Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/pull/49#issuecomment-4018645

chapmanb commented 12 years ago

Per; The processing steps all use the lane + flowcell + barcode_id as the unique identifiers. The worry with using user-provided sample names is that they might not be unique and could have spaces or other special characters that could mess with all the software being called. The "description_filename" is designed to let us use the guaranteed to work identifiers during processing, but then produce names that are meaningful to biologists when prepping the files. Let me know if this doesn't work for, Brad

percyfal commented 12 years ago

Brad,

your approach sounds reasonable, and yes, the issue of non-unique, weird, sample naming occurs here too (resolving these issues is a pain...). Disregard this pull request, I'll adopt the description_filename approach - the key, then, is to use unique 'description' fields.

Cheers,

Per

On Fri, Feb 17, 2012 at 3:42 PM, Brad Chapman < reply@reply.github.com

wrote:

Per; The processing steps all use the lane + flowcell + barcode_id as the unique identifiers. The worry with using user-provided sample names is that they might not be unique and could have spaces or other special characters that could mess with all the software being called. The "description_filename" is designed to let us use the guaranteed to work identifiers during processing, but then produce names that are meaningful to biologists when prepping the files. Let me know if this doesn't work for, Brad


Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/pull/49#issuecomment-4021109