Missing columns in final annotations.parquet lead to errors in seed gene pipeline. Failing save_merge.

Jonas-B-Frank commented 2 months ago

Processing my input data with your preprocessing and annotation pipelines, I get the input files for running the seed gene discovery pipeline (except the phenotypes parquet, which I created).

Running the seed gene discovery pipeline, for the annotations.parquet (integrating #73 in dense_gt.py to solve #70) I used the anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet" file. I get the following errors:

Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/backends.py", line 135, in wrapper return func(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 593, in read_parquet meta, index, columns = set_index_columns(meta, index, columns, auto_index_allowed) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 1518, in set_index_columns raise ValueError( ValueError: The following columns were not found in the dataset {'MAF_MB', 'CADD_PHRED', 'Absplice_DNA', 'MAF'} The following columns were found Index(['DeepRipe_plus_QKI_clip_k5', 'alphamissense', 'DeepRipe_plus_TARDBP_parclip', 'Consequence_inframe_insertion', 'DeepSEA_PC_4', 'combined_UKB_NFE_AF_MB', 'id', 'gene_id', 'DeepSEA_PC_2', 'combined_UKB_NFE_MAF', 'Consequence_missense_variant', 'DeepRipe_plus_QKI_parclip', 'Consequence_splice_donor_variant', 'combined_UKB_NFE_AF', 'condel_score', 'Consequence_start_lost', 'CADD_raw', 'DeepRipe_plus_HNRNPD_parclip', 'Consequence_protein_altering_variant', 'Consequence_stop_lost', 'Consequence_inframe_deletion', 'polyphen_score', 'PrimateAI_score', 'AF', 'DeepRipe_plus_ELAVL1_parclip', 'DeepSEA_PC_5', 'DeepRipe_plus_KHDRBS1_clip_k5', 'Consequence_splice_region_variant', 'DeepSEA_PC_1', 'Consequence_splice_acceptor_variant', 'Consequence_stop_gained', 'DeepSEA_PC_3', 'DeepRipe_plus_QKI_lip_hg2', 'DeepSEA_PC_6', 'DeepRipe_plus_MBNL1_parclip', 'Consequence_frameshift_variant', 'SpliceAI_delta_score', 'sift_score'], dtype='object')

The CADD_PHRED and Absplice_DNA columns are missing from anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet",. CADD_PHRED column was present in the anno_dir / (source_variant_file_pattern + "_vep_anno.tsv"), files. Absplice_DNA column was present in the score_file = anno_tmp_dir / "abSplice_score_file.parquet", file.

Just out of curiosity I additionally ran the pipeline again, with CADD_PHREDand Absplice_DNAdeleted from the config and specifying combined_UKB_NFE_MAF instead of MAF (also in the snakefile) and combined_UKB_NFE_AF_MB instead of MAF_MB (only in the data section of the yaml file, which I forgot for the first run).

Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')()) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset _, ds = make_dataset_( File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_ dataset = DenseGTDataset( File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 212, in __init__ self.setup_variants(min_common_variant_count, min_common_af, variants) File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 614, in setup_variants variants_with_af = safe_merge( File "PATH/deepRVAT_new/deeprvat/utils.py", line 259, in safe_merge raise RuntimeError( RuntimeError: Merged dataframe has 29960009 rows, left dataframe has 33883556

The input files anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet" and variants.parquet have 35814310 resp. 33883556 rows. I think something might go wrong when filtering the af_annoatationbefore merging. Looking forward to your insights!

Marcel-Mueck commented 2 months ago

Hey Jonas, thank you for your comment. Just to clarify, when you said you reran the pipeline, did you mean the annotation pipeline or one of the model training pipelines? The column names are set in the annotation pipeline using the pipelines/config/annotation_colnames_filling_values.yaml file, which has the syntax:

original_name:
   deeprvat_input_name:
      filling value(if the value of the annotaton model is NA)

for each of the annotation columns

Jonas-B-Frank commented 2 months ago

Hey Marcel,

I reran the seed gene discovery pipeline after eliminating CADD_PHRED and Absplice_DNA. I did not rerun the annotations pipeline. I ran the annotations pipeline with the config which was introduced in #54. There, the CADD_PHRED and the Absplice_DNA columns are missing - if they are not specified for renaming / filling, are those columns dropped? So do I have to integrate them in the pipelines/config/annotation_colnames_filling_values.yaml? Thanks for your quick response, best

Marcel-Mueck commented 2 months ago

You are right, these values are missing in the pipelines/config/annotation_colnames_filling_values.yaml and therefore dropped in the last step of the annoatation pipeline. I will update the file to have the columns integrated. The combined_UKB_NFE_MAF and combined_UKB_NFE_AF_MB should be renamed in that yaml file to MAF and MAF_MB respectively as well.

Jonas-B-Frank commented 2 months ago

I can confirm that adding

'CADD_PHRED' : 'CADD_PHRED': 0 'AbSplice_DNA' : 'AbSplice_DNA': 0

and adjusting

'maf_mb' : 'MAF_MB' : 10000 'maf' : 'MAF' : 0

to pipelines/config/annotation_colnames_filling_values.yaml solved the problem of the missing columns.

The merge, on the other hand, throws an error regardless. If wanted, I can of course open a separate issue for that.

Marcel-Mueck commented 2 months ago

Thank you for testing this, the corresponding PR was already created.

Marcel-Mueck commented 2 months ago

The PR was integrated into main, therefore I will close this issue :)

PMBio / deeprvat

Missing columns in final annotations.parquet lead to errors in seed gene pipeline. Failing save_merge. #74