MultiQC / MegaQC

Web application to collect and visualise data across multiple MultiQC runs.
http://megaqc.info/
GNU General Public License v3.0
95 stars 27 forks source link

Bug report - sample_data sometimes references incorrect sample_data_type entries #530

Open djwooten opened 4 months ago

djwooten commented 4 months ago

Describe the bug

When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.

This becomes problematic if you want to query for historic results based on data_section.

This is due to this code, which 1) Checks sample_data_type to see if the field's name has been seen before 2) If it has NOT been seen before, creates a new entry with data_key = "{}__{}".format(section, d_key).

But in step (1) it will reuse any key matching d_key, even if section does not match.

To Reproduce

Here is a barebones multiqc_config and set of report files that can reveal the issue.

multiqc_config.yaml

custom_data:
  Pipeline_A_Result:
    file_format: "csv"
  Pipeline_B_Result:
    file_format: "csv"
sp:
  Pipeline_A_Result:
    fn: "*A_report.csv"
  Pipeline_B_Result:
    fn: "*B_report.csv"

A_report.csv (generated by Pipeline A)

sample_id,patient_id,variant_count
sample_1,patient_1,10

B_report.csv (generated by Pipeline B)

sample_id,patient_id,pvalue
sample_2,patient_2,0.0001

Steps: 1) Run pipeline A and submit its data to megaqc, 2) Run pipeline B and submit its data to megaqc

megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.

Specifically, the sample_data and sample_data_type tables will look like

sample_data_type

sample_data_type_id data_id data_section data_key schema
0 patient_id Pipeline_A_Result-plot Pipeline_A_Result-plot__patient_id null
1 variant_count Pipeline_A_Result-plot Pipeline_A_Result-plot__variant_count null
2 pvalue Pipeline_B_Result-plot Pipeline_B_Result-plot__pvalue null

sample_data

sample_data_id report_id sample_data_type_id sample_id value
0 0 0 0 patient_1
1 0 1 0 10
2 1 0 (*) 1 patient_2
3 1 2 1 0.0001

* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.

Expected behavior

data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'

sample_data_type_id data_id data_section data_key schema
0 patient_id Pipeline_A_Result-plot Pipeline_A_Result-plot__patient_id null
1 variant_count Pipeline_A_Result-plot Pipeline_A_Result-plot__variant_count null
2 patient_id Pipeline_B_Result-plot Pipeline_B_Result-plot__patient_id null
3 pvalue Pipeline_B_Result-plot Pipeline_B_Result-plot__pvalue null

System