Bug report - sample_data sometimes references incorrect sample_data_type entries

Describe the bug

When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.

This becomes problematic if you want to query for historic results based on data_section.

This is due to this code, which 1) Checks sample_data_type to see if the field's name has been seen before 2) If it has NOT been seen before, creates a new entry with data_key = "{}__{}".format(section, d_key).

But in step (1) it will reuse any key matching d_key, even if section does not match.

To Reproduce

Here is a barebones multiqc_config and set of report files that can reveal the issue.

multiqc_config.yaml

custom_data:
  Pipeline_A_Result:
    file_format: "csv"
  Pipeline_B_Result:
    file_format: "csv"
sp:
  Pipeline_A_Result:
    fn: "*A_report.csv"
  Pipeline_B_Result:
    fn: "*B_report.csv"

A_report.csv (generated by Pipeline A)

sample_id,patient_id,variant_count
sample_1,patient_1,10

B_report.csv (generated by Pipeline B)

sample_id,patient_id,pvalue
sample_2,patient_2,0.0001

Steps: 1) Run pipeline A and submit its data to megaqc, 2) Run pipeline B and submit its data to megaqc

megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.

Specifically, the sample_data and sample_data_type tables will look like

sample_data_type

sample_data_type_id	data_id	data_section	data_key	schema
0	patient_id	Pipeline_A_Result-plot	Pipeline_A_Result-plot__patient_id	null
1	variant_count	Pipeline_A_Result-plot	Pipeline_A_Result-plot__variant_count	null
2	pvalue	Pipeline_B_Result-plot	Pipeline_B_Result-plot__pvalue	null

sample_data

sample_data_id	report_id	sample_data_type_id	sample_id	value
0	0	0	0	patient_1
1	0	1	0	10
2	1	0 (*)	1	patient_2
3	1	2	1	0.0001

* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.

Expected behavior

data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'

sample_data_type_id	data_id	data_section	data_key	schema
0	patient_id	Pipeline_A_Result-plot	Pipeline_A_Result-plot__patient_id	null
1	variant_count	Pipeline_A_Result-plot	Pipeline_A_Result-plot__variant_count	null
2	patient_id	Pipeline_B_Result-plot	Pipeline_B_Result-plot__patient_id	null
3	pvalue	Pipeline_B_Result-plot	Pipeline_B_Result-plot__pvalue	null

System

MegaQC: 0.3.0

MultiQC / MegaQC

Bug report - sample_data sometimes references incorrect sample_data_type entries #530