Closed percyfal closed 12 years ago
Some background: I've successfully set up lane-based sample analyses. However, the naming semantics become confusing. When working on real flowcells, lane is a numerical quantity, and there is little sense including a sample id due to multiplexing. In the other setting, where we are working on a 'virtual' flowcell, the only way to include sample id in the file names in downstream processing is to set e.g. 'lane: 1_SAMPLEID', right? Semantically, it would be clearer if lane is a numerical quantity, IMO. Of course there's a trade-off: the config files will need yet another field.
Per; Is your goal to get merging correct for samples in multiple lanes or to put sample names in the output files for distribution? For the first, that is done using the "name" parameter, which seems equivalent to your sample_id:
https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/pipeline/merge.py#L35
The code falls back on "description" if name is not provided.
If the goal is to prepare filenames for researchers with their supplied descriptions, we handle that post-analysis when moving the alignment, VCF and QC files out of the work directory. If you set "description_filename" to true for a sample then the output files will be named via description:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galaxy.py#L90
This is not the default since you can't always rely on the descriptions to be unique, but if you are setting them yourself you can use this for the equivalent of sample_id.
Do these work for what you want to do? Brad
Brad,
it's the second case I'm referring to. I'll look it up. I take it then that you always refer to lane by number?
Thanks again,
Per
On Fri, Feb 17, 2012 at 12:16 PM, Brad Chapman < reply@reply.github.com
wrote:
Per; Is your goal to get merging correct for samples in multiple lanes or to put sample names in the output files for distribution? For the first, that is done using the "name" parameter, which seems equivalent to your sample_id:
https://github.com/chapmanb/bcbb/blob/master/nextgen/bcbio/pipeline/merge.py#L35
The code falls back on "description" if name is not provided.
If the goal is to prepare filenames for researchers with their supplied descriptions, we handle that post-analysis when moving the alignment, VCF and QC files out of the work directory. If you set "description_filename" to true for a sample then the output files will be named via description:
https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/upload_to_galaxy.py#L90
This is not the default since you can't always rely on the descriptions to be unique, but if you are setting them yourself you can use this for the equivalent of sample_id.
Do these work for what you want to do? Brad
Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/pull/49#issuecomment-4018645
Per; The processing steps all use the lane + flowcell + barcode_id as the unique identifiers. The worry with using user-provided sample names is that they might not be unique and could have spaces or other special characters that could mess with all the software being called. The "description_filename" is designed to let us use the guaranteed to work identifiers during processing, but then produce names that are meaningful to biologists when prepping the files. Let me know if this doesn't work for, Brad
Brad,
your approach sounds reasonable, and yes, the issue of non-unique, weird, sample naming occurs here too (resolving these issues is a pain...). Disregard this pull request, I'll adopt the description_filename approach - the key, then, is to use unique 'description' fields.
Cheers,
Per
On Fri, Feb 17, 2012 at 3:42 PM, Brad Chapman < reply@reply.github.com
wrote:
Per; The processing steps all use the lane + flowcell + barcode_id as the unique identifiers. The worry with using user-provided sample names is that they might not be unique and could have spaces or other special characters that could mess with all the software being called. The "description_filename" is designed to let us use the guaranteed to work identifiers during processing, but then produce names that are meaningful to biologists when prepping the files. Let me know if this doesn't work for, Brad
Reply to this email directly or view it on GitHub: https://github.com/chapmanb/bcbb/pull/49#issuecomment-4021109
For analyses where samples are put in separate lanes, add optional 'sample_id' to lane_name