broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
160 stars 71 forks source link

Two small bugfixes to 02-EvidenceQC #666

Closed RCollins13 closed 2 months ago

RCollins13 commented 2 months ago

When running EvidenceQC.wdl on ~30k samples from the NIH AllOfUs cohort, I encountered two unrelated issues with the MakeQcTable task in EvidenceQC.wdl:

  1. EvidenceQC.wdl supports optionally disabling running VCF QC but the read_all_outlier() function in make_evidence_qc_table.py exits with an error when there are strictly zero outlier samples:
    Traceback (most recent call last):
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
    main()
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
    merge_evidence_qc_table(
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 178, in merge_evidence_qc_table
    df_total_high_outliers = read_all_outlier(df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, "high")
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 150, in read_all_outlier
    all_outliers_df.columns = [ID_COL, outlier_type + "_overall_outliers"]
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 5588, in __setattr__
    return object.__setattr__(self, name, value)
    File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 769, in _set_axis
    self._mgr.set_axis(axis, labels)
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 214, in set_axis
    self._validate_set_axis(axis, new_labels)
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/base.py", line 69, in _validate_set_axis
    raise ValueError(
    ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

I solved this by adding a conditional statement to check if there are zero outliers, in which case the function returns an empty dataframe with the expected headers (and this allows the rest of the script to run successfully).

  1. Dataframe merging in merge_evidence_qc_table() fails for cohorts where every sample has an integer ID. This seems to be due to pandas coercing some of the ID columns to dtype object whereas some are dtype int64 leading to this error:
    Traceback (most recent call last):
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
    main()
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
    merge_evidence_qc_table(
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in merge_evidence_qc_table
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
    File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in <lambda>
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
    self._maybe_coerce_merge_keys()
    File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
    raise ValueError(msg)
    ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

I fixed this by forcing all ID columns to dtype object prior to merging, which resolves this error.

(Both of these were encountered when using Docker image us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2024-03-04-v0.28.4-beta-f0ad3f0f, but based on the edit history of make_evidence_qc_table.py my impression is these should reflect the current main branch)

Thanks! Ryan