Closed Jonas-B-Frank closed 2 months ago
Hey Jonas, thank you for your comment. Just to clarify, when you said you reran the pipeline, did you mean the annotation pipeline or one of the model training pipelines? The column names are set in the annotation pipeline using the pipelines/config/annotation_colnames_filling_values.yaml
file, which has the syntax:
original_name:
deeprvat_input_name:
filling value(if the value of the annotaton model is NA)
for each of the annotation columns
Hey Marcel,
I reran the seed gene discovery pipeline after eliminating CADD_PHRED and Absplice_DNA. I did not rerun the annotations pipeline.
I ran the annotations pipeline with the config which was introduced in #54. There, the CADD_PHRED and the Absplice_DNA columns are missing - if they are not specified for renaming / filling, are those columns dropped? So do I have to integrate them in the pipelines/config/annotation_colnames_filling_values.yaml
?
Thanks for your quick response, best
You are right, these values are missing in the pipelines/config/annotation_colnames_filling_values.yaml
and therefore dropped in the last step of the annoatation pipeline. I will update the file to have the columns integrated. The combined_UKB_NFE_MAF
and combined_UKB_NFE_AF_MB
should be renamed in that yaml file to MAF
and MAF_MB
respectively as well.
I can confirm that adding
'CADD_PHRED' : 'CADD_PHRED': 0 'AbSplice_DNA' : 'AbSplice_DNA': 0
and adjusting
'maf_mb' : 'MAF_MB' : 10000 'maf' : 'MAF' : 0
to pipelines/config/annotation_colnames_filling_values.yaml
solved the problem of the missing columns.
The merge, on the other hand, throws an error regardless. If wanted, I can of course open a separate issue for that.
Thank you for testing this, the corresponding PR was already created.
The PR was integrated into main, therefore I will close this issue :)
Processing my input data with your preprocessing and annotation pipelines, I get the input files for running the seed gene discovery pipeline (except the phenotypes parquet, which I created).
Running the seed gene discovery pipeline, for the annotations.parquet (integrating #73 in dense_gt.py to solve #70) I used the
anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet"
file. I get the following errors:Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/backends.py", line 135, in wrapper return func(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 593, in read_parquet meta, index, columns = set_index_columns(meta, index, columns, auto_index_allowed) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 1518, in set_index_columns raise ValueError( ValueError: The following columns were not found in the dataset {'MAF_MB', 'CADD_PHRED', 'Absplice_DNA', 'MAF'} The following columns were found Index(['DeepRipe_plus_QKI_clip_k5', 'alphamissense', 'DeepRipe_plus_TARDBP_parclip', 'Consequence_inframe_insertion', 'DeepSEA_PC_4', 'combined_UKB_NFE_AF_MB', 'id', 'gene_id', 'DeepSEA_PC_2', 'combined_UKB_NFE_MAF', 'Consequence_missense_variant', 'DeepRipe_plus_QKI_parclip', 'Consequence_splice_donor_variant', 'combined_UKB_NFE_AF', 'condel_score', 'Consequence_start_lost', 'CADD_raw', 'DeepRipe_plus_HNRNPD_parclip', 'Consequence_protein_altering_variant', 'Consequence_stop_lost', 'Consequence_inframe_deletion', 'polyphen_score', 'PrimateAI_score', 'AF', 'DeepRipe_plus_ELAVL1_parclip', 'DeepSEA_PC_5', 'DeepRipe_plus_KHDRBS1_clip_k5', 'Consequence_splice_region_variant', 'DeepSEA_PC_1', 'Consequence_splice_acceptor_variant', 'Consequence_stop_gained', 'DeepSEA_PC_3', 'DeepRipe_plus_QKI_lip_hg2', 'DeepSEA_PC_6', 'DeepRipe_plus_MBNL1_parclip', 'Consequence_frameshift_variant', 'SpliceAI_delta_score', 'sift_score'], dtype='object')
The CADD_PHRED and Absplice_DNA columns are missing from
anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet",
. CADD_PHRED column was present in theanno_dir / (source_variant_file_pattern + "_vep_anno.tsv"),
files. Absplice_DNA column was present in thescore_file = anno_tmp_dir / "abSplice_score_file.parquet",
file.Just out of curiosity I additionally ran the pipeline again, with
CADD_PHRED
andAbsplice_DNA
deleted from the config and specifyingcombined_UKB_NFE_MAF
instead of MAF (also in the snakefile) andcombined_UKB_NFE_AF_MB
instead of MAF_MB (only in thedata section
of the yaml file, which I forgot for the first run).Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')()) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset _, ds = make_dataset_( File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_ dataset = DenseGTDataset( File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 212, in __init__ self.setup_variants(min_common_variant_count, min_common_af, variants) File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 614, in setup_variants variants_with_af = safe_merge( File "PATH/deepRVAT_new/deeprvat/utils.py", line 259, in safe_merge raise RuntimeError( RuntimeError: Merged dataframe has 29960009 rows, left dataframe has 33883556
The input files
anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet"
andvariants.parquet
have 35814310 resp. 33883556 rows. I think something might go wrong when filtering theaf_annoatation
before merging. Looking forward to your insights!