Closed Jonas-B-Frank closed 2 months ago
Hi Jonas,
Sorry for getting back to you with such a delay but I missed your issue. Have you been able to resolve the issue in the meantime? Otherwise I will look into it this week.
Best,
Eva
Hello Eva,
I am afraid I haven't found a fix or the cause yet.
Best, Jonas
Hi Jonas,
I starting checking your issue. The problem is in the input data/the annotation and preprocessing pipeline.
Your error shows that some variants that are in the variants.parquet
file are missing in anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet"
Since the safe_merge
does an inner join, this leads to the merged data frame having fewer rows than the variants.parquet
/(i.e., the variants
data frame).
We are still investigating what part in the preprocessing/annotation pipeline causes this issue and hope to give you more insights early next week.
To just get things running, you could try removing all variants not in vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet
from variants.parquet
. I'm not 100% this will work though since it might into another error later in DenseGTDatasets
when the genotypes.h5
get used....
Best, Eva
Hey Jonas, I am currently looking into the issue. I could not recreate the issue so far, though. It would help a lot if I could find out in what step of the annotation process the variants get 'lost'. It would be great if you could help me with that. The intermediate files of the annotation pipeline are all saved in annotations_dir/output/annotations
. One file in this directory is vep_deepripe.parquet
. Could you maybe check if this contains all the variants that are also in the preprocess_workdir/norm/varaints/variants.parquet
file? You could do this by loading the files and comparing nunique()
of the id
column of the files.
Your help would be greatly appreciated.
Regards Marcel.
Hello Eva and Marcel,
sure, happy to help if I can! I checked the two files and your presumption is correct. The variants.parquet
has 3923547 unique id's to itself, which are not present in vep_deepripe.parquet
(the exact amount of rows which are reported as missing in the seed gene pipeline)
Best Jonas
FYI: I didn't execute the pipeline with dropping the variants. If I do, I ll keep you up to date if it works
Hey Jonas, we made and pushed some changes to the preprocessing pipeline as well as the annotation pipeline to assure that the variant.parquet
file matches the final annotation parquet file. I hope that this resolves the issue, however when you are able to rerun the pipelines I am looking forward to hear from you whether it worked or not.
Regards,
Marcel
Will let you know asap, but will probably get to it only next week.
Hello Marcel and Eva,
I updated the code and reran from scratch. I am getting an error in the preprocessing pipeline, rule preprocess
:
deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 4,8,6,15,20,3,Y,13,14,1,16,17,9,5,12,2,11,10,18,X,19,22,7,21 preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/preprocessing/preprocess.py", line 280, in process_sparse_gt assert total_variants - len(variants) == len(variants_to_exclude) AssertionError
To get my input files, I am merging two multisample vcf files with bcftools (both from the same caller) to get one vcf file, which I than scatter by chr - might that lead to the error? If I can help in isolating the problem, let me know!
Have a nice weekend (and maybe holidays), best Jonas
Hello Marcel and Eva,
I updated the code and reran from scratch. I am getting an error in the preprocessing pipeline,
rule preprocess
:
deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 4,8,6,15,20,3,Y,13,14,1,16,17,9,5,12,2,11,10,18,X,19,22,7,21 preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/preprocessing/preprocess.py", line 280, in process_sparse_gt assert total_variants - len(variants) == len(variants_to_exclude) AssertionError
To get my input files, I am merging two multisample vcf files with bcftools (both from the same caller) to get one vcf file, which I than scatter by chr - might that lead to the error? If I can help in isolating the problem, let me know!
Have a nice weekend (and maybe holidays), best Jonas
Hi Jonas We updated that part of the preprocessing pipeline so it should not fail now even if you have extra variants in the exclude files. When you have time please test again with the code in main.
Hello Magnus,
seems to do the trick for this specific error, but I get one in the same rule a bit downstream:
deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 12,11,1,4,19,21,5,10,2,Y,6,16,14,3,15,8,7,22,17,9,13,20,18,X preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/preprocessing/preprocess.py", line 400, in process_sparse_gt max_len = np.max(count_variants) File "<__array_function__ internals>", line 5, in amax File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2754, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation maximum which has no identity
Best
Hello Magnus,
seems to do the trick for this specific error, but I get one in the same rule a bit downstream:
deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 12,11,1,4,19,21,5,10,2,Y,6,16,14,3,15,8,7,22,17,9,13,20,18,X preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/preprocessing/preprocess.py", line 400, in process_sparse_gt max_len = np.max(count_variants) File "<__array_function__ internals>", line 5, in amax File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2754, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation maximum which has no identity
Best
Hi!
We were able to reproduce your error if the samples array is empty. So for some reason all the samples have been filtered out, could you check if the files in preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples
and compare them to the the samples in preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv
.
If the exclude CSVs contains the same samples then this will fail.
This was it. Running again. Will keep you posted.
Hi Jonas, hope your issue is resolved now, but just to let you know, we also have a pipeline (pipelines/preprocess_no_qc.snakefile
) that allows you to do your own QC, in case the default values we have set aren't right for your data.
Hello Brian,
thanks, I did already notice and use that, but did not readjust after pulling from scratch - my fault.
After setting up fresh and running again, I get an error in the rule merge_annotations
in annotations.py:
vep_file[float_vals_present] = ( vep_file[float_vals_present].replace("-", "NaN").astype(float) ).
It seems like there are comma separated values for the AF in the vep_anno
files for a few rows, which can not be converted to floats afterwards.
HEADER=$(grep -n '#Uploaded_variation' PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv| head | cut -f 1 -d ':') && deeprvat_annotations merge-annotations $(($HEADER-1)) PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv PATH/preprocessing_and_annotation/annotations/chrX_variants.parclip_deepripe.csv.gz PATH/preprocessing_and_annotation/annotations/chrX_variants.eclip_hg2_deepripe.csv.gz PATH/preprocessing_and_annotation/annotations/chrX_variants.eclip_k5_deepripe.csv.gz PATH/preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet PATH/preprocessing_and_annotation/annotations/tmp/chrX_variants.vcf PATH/preprocessing_and_annotation/annotations/chrX_merged.parquet PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/deeprvat/annotations/annotations.py:1647: DtypeWarning: Columns (16,22,23,31) have mixed types. Specify dtype option on import or set low_memory=False. vep_df = pd.read_csv(vep_file, header=vep_header_line, sep="\t", na_values="-") Traceback (most recent call last): File "PATH/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')()) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/annotations/annotations.py", line 1650, in merge_annotations vep_df = process_vep( File "PATH/deeprvat/annotations/annotations.py", line 1799, in process_vep vep_file[float_vals_present].replace("-", "NaN").astype(float) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 448, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 352, in apply applied = getattr(b, f)(**kwargs) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array values = astype_nansafe(values, dtype, copy=copy) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe return arr.astype(dtype, copy=True) ValueError: could not convert string to float: '0.0008,0.3229'
The corresponding line in the vep file is:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation IMPACT DISTANCE STRAND FLAGS BIOTYPE CANONICAL ENSP SIFT PolyPhen AF CLIN_SIG SOMATIC PHENO CADD_PHRED CADD_RAW SpliceAI_pred PrimateAI Condel am_class am_pathogenicity
chrX_3209948_T_TATC X:3209944-3209948 - - - - intergenic_variant - - - - - rs200703810 MODIFIER - - - - - - - - 0.0008,0.3229 - - -1.078 0.001562 - - - - -
I think the most viable solution would be to catch multiple values in the AF value beforehand by doing sth like
split_columns = vep_file['AF'].str.split(',', expand=True)
split_columns = split_columns.apply(pd.to_numeric)
min_values = split_columns.min(axis=1)
vep_file['AF'] = min_values
But it might be easier to catch this situation during the annotation process itself. Furthermore I don't think that just setting the min value is the best solution for all posititons with 2 or more AF.
If I should open a new issue from now on, just let me know. Best
Hello Jonas, from your output it looks like the shell command is erroneos, as it begins with HEADER=$(grep -n '#Uploaded_variation' 0.0008,0.3229| head | cut -f 1 -d ':')
.. instead of '#Uploaded_variation' PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv| head | cut -f 1 -d ':')
.. was that a copy/paste mistake or did the shell command look like this?
Hello Marcel, sorry, I managed to integrate a copy paste mistake. Did correct this!
No worries, thank you for bringing the issue up.
It is quite interesting that VEP outputs multiple values for the AF
column.
On the website the column is only described as AF - Frequency of existing variant in 1000 Genomes
.
I have so far never ran into the issue.
However, I also tested the pipeline mostly on chr1-22, without the sex chromosomes.
Thank you for your suggested solution, that sounds really reasonable.
About opening a new issue, or continue on this thread:
I think both is fine, so whatever you prefer.
Regards, Marcel
I did get this error for mostly all chr, so not only gonosomes are affected by this issue. As for why VEP does this, I don't have any clue.
I will test running it without.
One possible reason for the comma separated list could be that the AF is used twice in the vep command (--af), once using --af
and once --{af_mode}
.
Furthermore force_overwrite
is duplicated in the command.
Good catch with the duplicates.
If you run vep without the --af
flag, you will have to remove the
'AF' :
'AF' : 0
values from pipelines/config/annotation_colnames_filling_values.yaml
for the annotation pipeline to finish successfully :)
When dropping the AF as you described, the annotation pipeline runs without error :)
When running the seed gene discovery pipeline however, I get an error during association_dataset
:
conda info | grep "active environment" && seed_gene_pipeline make-dataset --pickled-dataset-file ALS/plof/association_dataset_pickled.pkl ALS/plof/config.yaml ALS/plof/association_dataset_full.pkl
Traceback (most recent call last):
File "PATH/deeprvat/utils.py", line 257, in safe_merge
assert len(merged) == len(left)
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 590, in make_dataset
_, ds = make_dataset_(
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 541, in make_dataset_
dataset = DenseGTDataset(
File "PATH/deeprvat/data/dense_gt.py", line 212, in __init__
self.setup_variants(min_common_variant_count, min_common_af, variants)
File "PATH/deeprvat/data/dense_gt.py", line 616, in setup_variants
variants_with_af = safe_merge(
File "PATH/deeprvat/utils.py", line 259, in safe_merge
raise RuntimeError(
RuntimeError: Merged dataframe has 11610442 rows, left dataframe has 12733020
[Wed May 29 08:30:47 2024]
It seems like variants.parquet
and complete_annotations.parquet
have a different number of unique variants. I checked it manually and got 12733020 rows in variants.parquet
and 11610443 unique ids in complete_annotations.parquet
. I was wondering if this might have something to do with my altered preprocessing QC.
I eliminated
qc_hwe=expand(rules.qc_hwe.output, vcf_stem=vcf_stems),
qc_indmiss_samples=rules.process_individual_missingness.output,
from the input of rule preprocess and this from the shell section:
f"--exclude-variants {qc_hwe_dir}",
f"--exclude-samples {qc_filtered_samples_dir}",
commenting out the corresponding rules in the qc snakefile. I thought this would be fine before getting this error.
Greatful for your insights, best
Hey Jonas, could you check the 'vep_deepripe.parquet' file in the annotation output directory for how many unique variants there are to help pin down the issue?
It's 11610443 as well.
Hey Jonas, thank you for helping to pin it down.
One thing that might have gone wrong is that the vcf files that were used as an input for the annotation pipeline, do not match the normalized variants in the variants.parquet
file. To make sure that the annotation input is normalized the same way the variants are normalized, I would recommand using the BCF files in preprocessing/norm/bcf
as an input for the annotation pipeline( i.e. setting the source_variant_dir
variable in the annotation config to preprocessing_workdir/norm/bcf/
and the source_variant_file_type
to bcf
). One other thing to make sure is that the list in the included chromosomes
variable matches in the preprocessing and annotation config.
Using the normalized bcf files might be the solution for me. I am testing this and will get back to you Thank you for your help!
I once again get an error in the merge_annotations
rule, regarding the process_vep
function:
Traceback (most recent call last):
File "PATH/miniconda3/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "PATH/deeprvat/annotations/annotations.py", line 1650, in merge_annotations
vep_df = process_vep(
File "/hpc/gpfs2/scratch/g/bioinf-neuro/frajonas/deepRVAT_new/deeprvat/annotations/annotations.py", line 1754, in process_vep
assert (
AssertionError
This error occurs for all processed chromosomes. I checked the counts of unique values of the uploaded variation column (in the form of chr_pos_ref_alt
) in annotations/tmp/chr_variants.vcf
and annotations/chr_vep_anno.tsv
and they match.
I did notice however that multiallelic variants which have been normalized correctly into two entries with different alt in the variants.vcf
file, still have the same uploaded variations column, e.g.:
chr1_15274_A_G;chr1_15274_A_T
For this variant, in the annotations/chr_vep_anno.tsv
i only find one of the two variants in the uploaded variations column, with A to T not being listed in this column anymore (see below):
chr1_15274_A_G 1:15274 G
chr1_15274_A_G 1:15274 T
Maybe this leads to the unexpected behaviour?
Dear Jonas, I looked into the issue and found a way to circumvent it by regenerating the IDs for the VCFs that are used as input for VEP. This makes sure that each variant has a unique ID and is named accordingly in the VEP output. I created a PR which should be merged into main soon :)
Hey Marcel, the annotation pipeline runs through with your fix. Thanks!
Hello all,
after running the annotations pipeline without error, in the seed gene pipeline in regress_plof
, I get an error related to #70:
Traceback (most recent call last):
File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 641, in run_association
all_samples = np.array([int(i) for i in data_full["sample"]])
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 641, in <listcomp>
all_samples = np.array([int(i) for i in data_full["sample"]])
ValueError: invalid literal for int() with base 10: 'str_sample_name'
Sample names are cast to integers, throwing an error of course for str sample names.
Furthermore, I adjusted the seed_gene_discovery.py to incorporate Variants with 0 <= AF<0.001:
rare_threshold_config[maf_column] = (
f"{maf_column} < {rare_maf} and {maf_column} >= 0"
)
Otherwise, all variants are filtered out. I checked AF and I only have AF = 0 and AF >= 0.05. I am currently running on a test dataset, which might cause this, but the 0.05 threshold seems at least suspicious.
Also, i commented out one line in the call of dataloader, since batch_size is passed twice to the function otherwise (because it is defined in the config file and has different values for different steps, which is why I didn't remove it from the config file).
dataloader = DataLoader(
dataset,
#batch_size=batch_size, # reading from dataloader config
collate_fn=dataset.collate_fn,
**data_config["dataloader_config"],
)
Hi @Jonas-B-Frank, Thank you for reporting this issue. @endast and me have addressed the issue related to casting sample IDs to integers in #108 and merged the changes into main. Sample ids are not casted into integers any more now.
Regarding your second issue, the removal of all variants, this seems to be related to something weird happening in your data set (as you said, maybe related to you using a test data set). I don't quite understand how you can have variants with AF = 0 since they should get filtered out during the preporcessing. Still, let us know if we can help with that!
Hi @Jonas-B-Frank, I just wanted to double check if there is anything related to this issue which needs to be resolved? Otherwise I will close this issue in the next weeks.
Processing my input data with your preprocessing and annotation pipelines, I get the input files for running the seed gene discovery pipeline (except the phenotypes parquet, which I created). In
rule association_dataset
, I get the following error:Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')()) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset _, ds = make_dataset_( File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_ dataset = DenseGTDataset( File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 212, in __init__ self.setup_variants(min_common_variant_count, min_common_af, variants) File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 614, in setup_variants variants_with_af = safe_merge( File "PATH/deepRVAT_new/deeprvat/utils.py", line 259, in safe_merge raise RuntimeError( RuntimeError: Merged dataframe has 29960009 rows, left dataframe has 33883556
The input files
anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet"
andvariants.parquet
have 35814310 resp. 33883556 rows. I think something might go wrong when filtering the af_annoatation before merging. Looking forward to your insights!