PMBio / deeprvat

Other
15 stars 1 forks source link

Column naming in annotations pipeline #68

Closed Jonas-B-Frank closed 2 months ago

Jonas-B-Frank commented 2 months ago

In rule select_rename_fill_columns, I get the following error:

Traceback (most recent call last): File "PATH/deeprvat/annotations/annotations.py", line 2128, in <module> cli() File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/annotations/annotations.py", line 2119, in select_rename_fill_annotations anno_df = pd.read_parquet( File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/io/parquet.py", line 503, in read_parquet return impl.read( File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/io/parquet.py", line 251, in read result = self.api.parquet.read_table( File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2973, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2601, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 369, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 336, in pyarrow._dataset.Dataset.scanner File "pyarrow/_dataset.pyx", line 2576, in pyarrow._dataset.Scanner.from_dataset File "pyarrow/_dataset.pyx", line 2484, in pyarrow._dataset.Scanner._make_scan_options File "pyarrow/_dataset.pyx", line 2385, in pyarrow._dataset._populate_builder File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(Condel) in Consequence_splice_acceptor_variant: int64 MBNL1_parclip: double Consequence_frameshift_variant: int64 Consequence_splice_region_variant: int64 Consequence_inframe_insertion: int64 SpliceAI_delta_score: double KHDRBS1_k5: double Consequence_inframe_deletion: int64 CADD_RAW: double QKI_k5: double HNRNPD_parclip: double PrimateAI: double PolyPhen: double Consequence_protein_altering_variant: int64 Consequence_stop_lost: int64 TARDBP_parclip: double Consequence_start_lost: int64 ELAVL1_parclip: double AF: double SIFT: double Consequence_missense_variant: int64 Consequence_splice_donor_variant: int64 QKI_hg2: double Consequence_stop_gained: int64 QKI_parclip: double chrom: string pos: int64 ref: string alt: string id: int64 gene_base: string DeepSEA_PC_2: double DeepSEA_PC_3: double DeepSEA_PC_1: double DeepSEA_PC_5: double DeepSEA_PC_4: double DeepSEA_PC_6: double AbSplice_DNA: double af: double maf: double maf_mb: double gene_id: double __index_level_0__: int64 __fragment_index: int32 __batch_index: int32 __last_in_fragment: bool __filename: string

I don't know if I messed up the Condel plugin initialization or it is a coding error. Greatly appreciate your help.

Marcel-Mueck commented 2 months ago

Hey Jonas, thank you for opening this issue, I could not recreate it so far. Do you have any Condel related columns in your vep output files? If so, how are they named?

Jonas-B-Frank commented 2 months ago

Hello Marcel,

I do not see any condel related columns in my vep output files (output files of rule vep with _vep_anno.tsv extension). I also however do not see any AlphaMissense related columns. CADD, SpliceAI and PrimateAI columns are present.

The command which is stated in the vep output files is: ## VEP command-line: vep --af --assembly GRCh38 --biotype --cache --canonical --database 0 --dir_cache [PATH]/cache --dir_plugins [PATH]/Plugins --fasta [PATH]/Homo_sapiens_assembly38.fasta --force_overwrite --fork 5 --format vcf --input_file [PATH]/merged_case_control_chr2_stripped.vcf.gz --no_escape --no_stats --offline --output_file [PATH]/merged_case_control_chr2_vep_anno.tsv --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin [PATH]/AlphaMissense_hg38.tsv.gz --polyphen s --protein --sift s --tab --total_length

To me it looks like the plugin command is not correctly expanded to include all plugins. Alphamissense.pm was missing from repo_dir/ensembl-vep/Plugins, though. VEP Plugins are specified in the annotations config file like this:

additional_vep_plugin_cmds:
  cadd : CADD,PATH/preprocessing_and_annotation/annotation_data/cadd/whole_genome_SNVs.tsv.gz,PATH/preprocessing_and_annotation/annotation_data/cadd/gnomad.genomes.r4.0.indel.tsv.gz
  spliceAI : SpliceAI,snv=PATH/preprocessing_and_annotation/annotation_data/spliceAI/spliceai_scores.raw.snv.hg38.vcf.gz,indel=PATH/preprocessing_and_annotation/annotation_data/spliceAI/spliceai_scores.raw.indel.hg38.vcf.gz
  primateAI : PrimateAI,PATH/preprocessing_and_annotation/annotation_data/primateAI/PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz
  condel: Condel,PATH/preprocessing_and_annotation/repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2
  alphamissense : AlphaMissense,file=PATH/references_hg38/AlphaMissense_hg38.tsv.gz
Marcel-Mueck commented 2 months ago

Do you have any files containing _vep_anno.tsv_warnings.txtin the output directory?

Marcel-Mueck commented 2 months ago

Also, when you rerun the snakemake pipeline with the n and p flags, it should perform a dryrun while printing out the shell commands. So when you run snakemake --snakefile path/to/deeprvat/pipelines/annotations.snakefile --configfile path/to/deeprvat/pipelines/config/deeprvat_annotation_config.yaml -np --forceall it should print out (among all other commands) the command it used for the vep rule without running anything.

Jonas-B-Frank commented 2 months ago

Thank you for your suggestions Marcel! I indeed had warning files. I forgot to tabix the AlphaMissense data (additionally to not including the .pm file for AlphaMissense) and adjust the condel config path, which is why both Plugins have been excluded from the vep run. I am running vep again and keep you posted, if the issue is resolved.

Marcel-Mueck commented 2 months ago

That is nice to hear, keep me updated :)

Jonas-B-Frank commented 2 months ago

Pipeline finished without error. Thanks! :)