Two small bugfixes to 02-EvidenceQC

When running EvidenceQC.wdl on ~30k samples from the NIH AllOfUs cohort, I encountered two unrelated issues with the MakeQcTable task in EvidenceQC.wdl:

EvidenceQC.wdl supports optionally disabling running VCF QC but the read_all_outlier() function in make_evidence_qc_table.py exits with an error when there are strictly zero outlier samples:

Traceback (most recent call last):
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
main()
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
merge_evidence_qc_table(
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 178, in merge_evidence_qc_table
df_total_high_outliers = read_all_outlier(df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, "high")
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 150, in read_all_outlier
all_outliers_df.columns = [ID_COL, outlier_type + "_overall_outliers"]
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 5588, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 769, in _set_axis
self._mgr.set_axis(axis, labels)
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 214, in set_axis
self._validate_set_axis(axis, new_labels)
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/base.py", line 69, in _validate_set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

I solved this by adding a conditional statement to check if there are zero outliers, in which case the function returns an empty dataframe with the expected headers (and this allows the rest of the script to run successfully).

Dataframe merging in merge_evidence_qc_table() fails for cohorts where every sample has an integer ID. This seems to be due to pandas coercing some of the ID columns to dtype object whereas some are dtype int64 leading to this error:

Traceback (most recent call last):
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
main()
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
merge_evidence_qc_table(
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in merge_evidence_qc_table
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in <lambda>
output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
op = _MergeOperation(
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
self._maybe_coerce_merge_keys()
File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

I fixed this by forcing all ID columns to dtype object prior to merging, which resolves this error.

(Both of these were encountered when using Docker image us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2024-03-04-v0.28.4-beta-f0ad3f0f, but based on the edit history of make_evidence_qc_table.py my impression is these should reflect the current main branch)

Thanks! Ryan

broadinstitute / gatk-sv

Two small bugfixes to 02-EvidenceQC #666