Closed samuelklee closed 7 years ago
Just to record and clarify a few things we discussed:
-I think that all files we generate for individual case samples---"ReadCountCollection" files for coverage profiles, "AllelicCountCollection" files for het pulldowns, and segment files---should contain the sample name as metadata in a header comment with a common tag (e.g., #sampleName = ...
). Currently, these sample names are stored in column headers, in the fields of a SAMPLE
column, or not at all, depending on the type of file. This would drastically simplify the use of the SampleNameFinder
class, which would basically only contain a single method to parse this header comment and return the name.
-CLIs that generate a file from an input BAM (CalculateTargetCoverage
, GetHetCoverage
, etc.) should take the sample name from that BAM by default. Since these are the first steps in our workflows, we could also optionally allow the user to specify a sample name different from that in the BAM.
-Subsequent CLIs should then take the sample name from the header comment.
-CLIs that take multiple non-BAM input files should check for consistency of the sample names as part of the argument validation step.
-CLIs that output the sample name in plots should derive these from the header comment.
-For files that contain data from multiple samples (e.g., the output of CombineReadCounts
), we can probably leave the sample names in the column headers, but it would be nice to output the type of data stored in a header comment as well (e.g., PCOV
or RAW
). At some point I think we should restrict to RAW
output only, see #615.
-Entity names specified by the input file for the WDLs can be separate from the BAM sample names by default. However, if we do allow the user to optionally specify sample names as described in the first bullet point, we can set up the WDL to pass the entity names.
I'll leave the decisions up to you, @asmirnov239. Just try to implement something that is as simple as possible and consistent!
Moved to a new issue #751
@asmirnov239 If you decide to tackle this, it occurs to me that it might be better to just store the sample name in a header comment.