Background:
The MiCall pipeline currently processes reads on per-real-sample basis and outputs an assembled consensus sequence for them. Each run relies on SampleSheet.csv files for input and output details. A feature to merge samples, ideally across different runs, would simplify the downstream analysis.
Feature Description:
Introduce a merger tool that takes a .csv mapping file and generates a merged SampleSheet.csv, RunInfo.xml, and a duplicate of the input .csv for traceability. The mapping file correlates sample_name and run_folder with output_name, specifying the merging plan.
Feature Objectives:
Facilitate efficient sample mergers across different run folders.
Ensure consistency and traceability for merged samples.
Handle default values and conflicts in input .csv files.
Functional Requirements:
Input to the tool:
Path to the mapping .csv file.
Path to the output folder.
Outputs of the tool:
SampleSheet.csv with merged output_name records.
RunInfo.xml copied from the first associated run_folder.
Input .csv file to trace origins of merged data.
Conflict resolution strategy, with a strict mode option (--strict flag).
Conflict Resolution Rules:
project_name header field to follow the $current_date.merged pattern.
date header field to reflect the actual merge date.
All other fields should use the first observed value unless --strict is enabled.
Fields index and index2 should default to XXXXX.
Implementation Tasks:
[X] Develop a merging script for the underlying sample files.
[ ] Develop logic to parse the input .csv and handle row defaults.
[ ] Implement conflict detection logic with stdout reporting.
[ ] Create file generation procedures for SampleSheet.csv and RunInfo.xml.
[ ] Build merging algorithm to create a consolidated .csv from the mapping file.
[ ] Add a --non-strict mode for conflict resolution, with it becoming the default.
[ ] Write unit tests to validate merging logic and conflict handling.
[ ] Add documentation for the merger tool usage and features.
Background: The MiCall pipeline currently processes reads on per-real-sample basis and outputs an assembled consensus sequence for them. Each run relies on
SampleSheet.csv
files for input and output details. A feature to merge samples, ideally across different runs, would simplify the downstream analysis.Feature Description: Introduce a merger tool that takes a
.csv
mapping file and generates a mergedSampleSheet.csv
,RunInfo.xml
, and a duplicate of the input.csv
for traceability. The mapping file correlatessample_name
andrun_folder
withoutput_name
, specifying the merging plan.Feature Objectives:
.csv
files.Functional Requirements:
.csv
file.SampleSheet.csv
with mergedoutput_name
records.RunInfo.xml
copied from the first associatedrun_folder
..csv
file to trace origins of merged data.--strict
flag).Conflict Resolution Rules:
project_name
header field to follow the$current_date.merged
pattern.date
header field to reflect the actual merge date.--strict
is enabled.index
andindex2
should default toXXXXX
.Implementation Tasks:
.csv
and handle row defaults.SampleSheet.csv
andRunInfo.xml
..csv
from the mapping file.--non-strict
mode for conflict resolution, with it becoming the default.