When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.
This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
1) Checks sample_data_type to see if the field's name has been seen before
2) If it has NOT been seen before, creates a new entry with data_key = "{}__{}".format(section, d_key).
But in step (1) it will reuse any key matching d_key, even if section does not match.
To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
Steps:
1) Run pipeline A and submit its data to megaqc,
2) Run pipeline B and submit its data to megaqc
megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.
Specifically, the sample_data and sample_data_type tables will look like
sample_data_type
sample_data_type_id
data_id
data_section
data_key
schema
0
patient_id
Pipeline_A_Result-plot
Pipeline_A_Result-plot__patient_id
null
1
variant_count
Pipeline_A_Result-plot
Pipeline_A_Result-plot__variant_count
null
2
pvalue
Pipeline_B_Result-plot
Pipeline_B_Result-plot__pvalue
null
sample_data
sample_data_id
report_id
sample_data_type_id
sample_id
value
0
0
0
0
patient_1
1
0
1
0
10
2
1
0 (*)
1
patient_2
3
1
2
1
0.0001
* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.
Expected behavior
data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'
Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that
SampleDataType
already exists. However it only checks on the basis ofdata_id
, but ignoresdata_section
. Therefore if multiple report types (data sections) reuse the samedata_id
, currently this will reuse thatSampleDataType
even ifdata_section
is wrong for the incoming report.This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which 1) Checks
sample_data_type
to see if the field's name has been seen before 2) If it has NOT been seen before, creates a new entry withdata_key = "{}__{}".format(section, d_key)
.But in step (1) it will reuse any key matching
d_key
, even ifsection
does not match.To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yaml
A_report.csv
(generated by Pipeline A)B_report.csv
(generated by Pipeline B)Steps: 1) Run pipeline A and submit its data to megaqc, 2) Run pipeline B and submit its data to megaqc
megaqc erroneously associates
patient_id
to only come fromPipeline_A_Result
, even though in one case it comes fromPipeline_B_Result
.Specifically, the
sample_data
andsample_data_type
tables will look likesample_data_type
sample_data
Expected behavior
data_id='patient_id'
will appear in two separatesample_data_type
rows, once withdata_section='Pipeline_A_Result-plot'
and once withdata_section='Pipeline_B_Result-plot'
System