broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Sample names with underscore or period get truncated. #636

Closed samuelklee closed 7 years ago

samuelklee commented 8 years ago

@asmirnov239 If you decide to tackle this, it occurs to me that it might be better to just store the sample name in a header comment.

samuelklee commented 8 years ago

Just to record and clarify a few things we discussed:

-I think that all files we generate for individual case samples---"ReadCountCollection" files for coverage profiles, "AllelicCountCollection" files for het pulldowns, and segment files---should contain the sample name as metadata in a header comment with a common tag (e.g., #sampleName = ...). Currently, these sample names are stored in column headers, in the fields of a SAMPLE column, or not at all, depending on the type of file. This would drastically simplify the use of the SampleNameFinder class, which would basically only contain a single method to parse this header comment and return the name.

-CLIs that generate a file from an input BAM (CalculateTargetCoverage, GetHetCoverage, etc.) should take the sample name from that BAM by default. Since these are the first steps in our workflows, we could also optionally allow the user to specify a sample name different from that in the BAM.

-Subsequent CLIs should then take the sample name from the header comment.

-CLIs that take multiple non-BAM input files should check for consistency of the sample names as part of the argument validation step.

-CLIs that output the sample name in plots should derive these from the header comment.

-For files that contain data from multiple samples (e.g., the output of CombineReadCounts), we can probably leave the sample names in the column headers, but it would be nice to output the type of data stored in a header comment as well (e.g., PCOV or RAW). At some point I think we should restrict to RAW output only, see #615.

-Entity names specified by the input file for the WDLs can be separate from the BAM sample names by default. However, if we do allow the user to optionally specify sample names as described in the first bullet point, we can set up the WDL to pass the entity names.

I'll leave the decisions up to you, @asmirnov239. Just try to implement something that is as simple as possible and consistent!

asmirnov239 commented 7 years ago

Moved to a new issue #751