Failing save_merge in association_dataset in seed gene discovery pipeline

Jonas-B-Frank commented 7 months ago

Processing my input data with your preprocessing and annotation pipelines, I get the input files for running the seed gene discovery pipeline (except the phenotypes parquet, which I created). In rule association_dataset, I get the following error:

Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')()) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset _, ds = make_dataset_( File "PATH/deepRVAT_new/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_ dataset = DenseGTDataset( File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 212, in __init__ self.setup_variants(min_common_variant_count, min_common_af, variants) File "PATH/deepRVAT_new/deeprvat/data/dense_gt.py", line 614, in setup_variants variants_with_af = safe_merge( File "PATH/deepRVAT_new/deeprvat/utils.py", line 259, in safe_merge raise RuntimeError( RuntimeError: Merged dataframe has 29960009 rows, left dataframe has 33883556

The input files anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet" and variants.parquet have 35814310 resp. 33883556 rows. I think something might go wrong when filtering the af_annoatation before merging. Looking forward to your insights!

HolEv commented 7 months ago

Hi Jonas,

Sorry for getting back to you with such a delay but I missed your issue. Have you been able to resolve the issue in the meantime? Otherwise I will look into it this week.

Best,

Eva

Jonas-B-Frank commented 7 months ago

Hello Eva,

I am afraid I haven't found a fix or the cause yet.

Best, Jonas

HolEv commented 6 months ago

Hi Jonas,

I starting checking your issue. The problem is in the input data/the annotation and preprocessing pipeline. Your error shows that some variants that are in the variants.parquet file are missing in anno_dir / "vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet" Since the safe_merge does an inner join, this leads to the merged data frame having fewer rows than the variants.parquet/(i.e., the variants data frame). We are still investigating what part in the preprocessing/annotation pipeline causes this issue and hope to give you more insights early next week.

To just get things running, you could try removing all variants not in vep_deepripe_deepsea_absplice_maf_pIDs_filtered_filled.parquet from variants.parquet. I'm not 100% this will work though since it might into another error later in DenseGTDatasets when the genotypes.h5 get used....

Best, Eva

Marcel-Mueck commented 6 months ago

Hey Jonas, I am currently looking into the issue. I could not recreate the issue so far, though. It would help a lot if I could find out in what step of the annotation process the variants get 'lost'. It would be great if you could help me with that. The intermediate files of the annotation pipeline are all saved in annotations_dir/output/annotations. One file in this directory is vep_deepripe.parquet. Could you maybe check if this contains all the variants that are also in the preprocess_workdir/norm/varaints/variants.parquet file? You could do this by loading the files and comparing nunique() of the id column of the files. Your help would be greatly appreciated. Regards Marcel.

Jonas-B-Frank commented 6 months ago

Hello Eva and Marcel,

sure, happy to help if I can! I checked the two files and your presumption is correct. The variants.parquet has 3923547 unique id's to itself, which are not present in vep_deepripe.parquet (the exact amount of rows which are reported as missing in the seed gene pipeline)

Best Jonas

FYI: I didn't execute the pipeline with dropping the variants. If I do, I ll keep you up to date if it works

Marcel-Mueck commented 6 months ago

Hey Jonas, we made and pushed some changes to the preprocessing pipeline as well as the annotation pipeline to assure that the variant.parquet file matches the final annotation parquet file. I hope that this resolves the issue, however when you are able to rerun the pipelines I am looking forward to hear from you whether it worked or not. Regards, Marcel

Jonas-B-Frank commented 6 months ago

Will let you know asap, but will probably get to it only next week.

Jonas-B-Frank commented 6 months ago

Hello Marcel and Eva,

I updated the code and reran from scratch. I am getting an error in the preprocessing pipeline, rule preprocess:

deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 4,8,6,15,20,3,Y,13,14,1,16,17,9,5,12,2,11,10,18,X,19,22,7,21 preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/preprocessing/preprocess.py", line 280, in process_sparse_gt assert total_variants - len(variants) == len(variants_to_exclude) AssertionError

To get my input files, I am merging two multisample vcf files with bcftools (both from the same caller) to get one vcf file, which I than scatter by chr - might that lead to the error? If I can help in isolating the problem, let me know!

Have a nice weekend (and maybe holidays), best Jonas

endast commented 6 months ago

Hello Marcel and Eva,

I updated the code and reran from scratch. I am getting an error in the preprocessing pipeline, rule preprocess:

deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 4,8,6,15,20,3,Y,13,14,1,16,17,9,5,12,2,11,10,18,X,19,22,7,21 preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/miniconda3/envs/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/miniconda3/envs/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/preprocessing/preprocess.py", line 280, in process_sparse_gt assert total_variants - len(variants) == len(variants_to_exclude) AssertionError

To get my input files, I am merging two multisample vcf files with bcftools (both from the same caller) to get one vcf file, which I than scatter by chr - might that lead to the error? If I can help in isolating the problem, let me know!

Have a nice weekend (and maybe holidays), best Jonas

Hi Jonas We updated that part of the preprocessing pipeline so it should not fail now even if you have extra variants in the exclude files. When you have time please test again with the code in main.

Jonas-B-Frank commented 6 months ago

Hello Magnus,

seems to do the trick for this specific error, but I get one in the same rule a bit downstream:

deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 12,11,1,4,19,21,5,10,2,Y,6,16,14,3,15,8,7,22,17,9,13,20,18,X preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/preprocessing/preprocess.py", line 400, in process_sparse_gt max_len = np.max(count_variants) File "<__array_function__ internals>", line 5, in amax File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2754, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation maximum which has no identity

Best

endast commented 6 months ago

Hello Magnus,

seems to do the trick for this specific error, but I get one in the same rule a bit downstream:

deeprvat_preprocess process-sparse-gt --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/allelic_imbalance --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/varmiss --exclude-variants preprocessing_and_annotation/preprocessing_workdir/qc/duplicate_vars --exclude-calls preprocessing_and_annotation/preprocessing_workdir/qc/read_depth --exclude-samples preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples --chromosomes 12,11,1,4,19,21,5,10,2,Y,6,16,14,3,15,8,7,22,17,9,13,20,18,X preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv preprocessing_and_annotation/preprocessing_workdir/norm/sparse preprocessing_and_annotation/preprocessing_workdir/preprocesed/genotypes Activating conda environment: deeprvat_preprocess Traceback (most recent call last): File "PATH/deeprvat_preprocess/bin/deeprvat_preprocess", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_preprocess')()) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "PATH/deepRVAT_new/deeprvat/preprocessing/preprocess.py", line 400, in process_sparse_gt max_len = np.max(count_variants) File "<__array_function__ internals>", line 5, in amax File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2754, in amax return _wrapreduction(a, np.maximum, 'max', axis, None, out, File "PATH/deeprvat_preprocess/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation maximum which has no identity

Best

Hi! We were able to reproduce your error if the samples array is empty. So for some reason all the samples have been filtered out, could you check if the files in preprocessing_and_annotation/preprocessing_workdir/qc/filtered_samples and compare them to the the samples in preprocessing_and_annotation/preprocessing_workdir/norm/samples_chr.csv.

If the exclude CSVs contains the same samples then this will fail.

Jonas-B-Frank commented 6 months ago

This was it. Running again. Will keep you posted.

bfclarke commented 6 months ago

Hi Jonas, hope your issue is resolved now, but just to let you know, we also have a pipeline (pipelines/preprocess_no_qc.snakefile) that allows you to do your own QC, in case the default values we have set aren't right for your data.

Jonas-B-Frank commented 6 months ago

Hello Brian,

thanks, I did already notice and use that, but did not readjust after pulling from scratch - my fault.

After setting up fresh and running again, I get an error in the rule merge_annotations in annotations.py: vep_file[float_vals_present] = ( vep_file[float_vals_present].replace("-", "NaN").astype(float) ). It seems like there are comma separated values for the AF in the vep_anno files for a few rows, which can not be converted to floats afterwards.

HEADER=$(grep -n '#Uploaded_variation' PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv| head | cut -f 1 -d ':') && deeprvat_annotations merge-annotations $(($HEADER-1)) PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv PATH/preprocessing_and_annotation/annotations/chrX_variants.parclip_deepripe.csv.gz PATH/preprocessing_and_annotation/annotations/chrX_variants.eclip_hg2_deepripe.csv.gz PATH/preprocessing_and_annotation/annotations/chrX_variants.eclip_k5_deepripe.csv.gz PATH/preprocessing_and_annotation/preprocessing_workdir/norm/variants/variants.parquet PATH/preprocessing_and_annotation/annotations/tmp/chrX_variants.vcf PATH/preprocessing_and_annotation/annotations/chrX_merged.parquet PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index) PATH/deeprvat/annotations/annotations.py:1647: DtypeWarning: Columns (16,22,23,31) have mixed types. Specify dtype option on import or set low_memory=False. vep_df = pd.read_csv(vep_file, header=vep_header_line, sep="\t", na_values="-") Traceback (most recent call last): File "PATH/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module> sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')()) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "PATH/deeprvat/annotations/annotations.py", line 1650, in merge_annotations vep_df = process_vep( File "PATH/deeprvat/annotations/annotations.py", line 1799, in process_vep vep_file[float_vals_present].replace("-", "NaN").astype(float) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 448, in astype return self.apply("astype", dtype=dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 352, in apply applied = getattr(b, f)(**kwargs) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array values = astype_nansafe(values, dtype, copy=copy) File "PATH/envs/deeprvat_annotations/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe return arr.astype(dtype, copy=True) ValueError: could not convert string to float: '0.0008,0.3229'

The corresponding line in the vep file is:

#Uploaded_variation     Location        Allele  Gene    Feature Feature_type    Consequence     cDNA_position   CDS_position    Protein_position        Amino_acids     Codons  Existing_variation      IMPACT  DISTANCE        STRAND  FLAGS   BIOTYPE CANONICAL       ENSP         SIFT    PolyPhen        AF      CLIN_SIG        SOMATIC PHENO   CADD_PHRED      CADD_RAW        SpliceAI_pred   PrimateAI       Condel  am_class        am_pathogenicity

chrX_3209948_T_TATC     X:3209944-3209948       -       -       -       -       intergenic_variant      -       -       -       -       -       rs200703810     MODIFIER        -       -       -       -       -       -       -       -       0.0008,0.3229   -       -   -1.078   0.001562        -       -       -       -       -

I think the most viable solution would be to catch multiple values in the AF value beforehand by doing sth like

split_columns = vep_file['AF'].str.split(',', expand=True)
split_columns = split_columns.apply(pd.to_numeric)
min_values = split_columns.min(axis=1)
vep_file['AF'] = min_values

But it might be easier to catch this situation during the annotation process itself. Furthermore I don't think that just setting the min value is the best solution for all posititons with 2 or more AF.

If I should open a new issue from now on, just let me know. Best

Marcel-Mueck commented 6 months ago

Hello Jonas, from your output it looks like the shell command is erroneos, as it begins with HEADER=$(grep -n '#Uploaded_variation' 0.0008,0.3229| head | cut -f 1 -d ':').. instead of '#Uploaded_variation' PATH/preprocessing_and_annotation/annotations/chrX_vep_anno.tsv| head | cut -f 1 -d ':').. was that a copy/paste mistake or did the shell command look like this?

Jonas-B-Frank commented 6 months ago

Hello Marcel, sorry, I managed to integrate a copy paste mistake. Did correct this!

Marcel-Mueck commented 6 months ago

No worries, thank you for bringing the issue up. It is quite interesting that VEP outputs multiple values for the AF column. On the website the column is only described as AF - Frequency of existing variant in 1000 Genomes. I have so far never ran into the issue. However, I also tested the pipeline mostly on chr1-22, without the sex chromosomes. Thank you for your suggested solution, that sounds really reasonable. About opening a new issue, or continue on this thread: I think both is fine, so whatever you prefer.

Regards, Marcel

Jonas-B-Frank commented 6 months ago

I did get this error for mostly all chr, so not only gonosomes are affected by this issue. As for why VEP does this, I don't have any clue.

Jonas-B-Frank commented 6 months ago

I will test running it without.

One possible reason for the comma separated list could be that the AF is used twice in the vep command (--af), once using --af and once --{af_mode}. Furthermore force_overwrite is duplicated in the command.

Marcel-Mueck commented 6 months ago

Good catch with the duplicates. If you run vep without the --af flag, you will have to remove the

'AF' :
    'AF' : 0

values from pipelines/config/annotation_colnames_filling_values.yaml for the annotation pipeline to finish successfully :)

Jonas-B-Frank commented 6 months ago

When dropping the AF as you described, the annotation pipeline runs without error :)

Jonas-B-Frank commented 6 months ago

When running the seed gene discovery pipeline however, I get an error during association_dataset:

conda info | grep "active environment" && seed_gene_pipeline make-dataset --pickled-dataset-file ALS/plof/association_dataset_pickled.pkl ALS/plof/config.yaml ALS/plof/association_dataset_full.pkl
Traceback (most recent call last):
  File "PATH/deeprvat/utils.py", line 257, in safe_merge
    assert len(merged) == len(left)
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 590, in make_dataset
    _, ds = make_dataset_(
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 541, in make_dataset_
    dataset = DenseGTDataset(
  File "PATH/deeprvat/data/dense_gt.py", line 212, in __init__
    self.setup_variants(min_common_variant_count, min_common_af, variants)
  File "PATH/deeprvat/data/dense_gt.py", line 616, in setup_variants
    variants_with_af = safe_merge(
  File "PATH/deeprvat/utils.py", line 259, in safe_merge
    raise RuntimeError(
RuntimeError: Merged dataframe has 11610442 rows, left dataframe has 12733020
[Wed May 29 08:30:47 2024]

It seems like variants.parquet and complete_annotations.parquet have a different number of unique variants. I checked it manually and got 12733020 rows in variants.parquet and 11610443 unique ids in complete_annotations.parquet. I was wondering if this might have something to do with my altered preprocessing QC. I eliminated

qc_hwe=expand(rules.qc_hwe.output, vcf_stem=vcf_stems),
qc_indmiss_samples=rules.process_individual_missingness.output,

from the input of rule preprocess and this from the shell section:

f"--exclude-variants {qc_hwe_dir}",
f"--exclude-samples {qc_filtered_samples_dir}",

commenting out the corresponding rules in the qc snakefile. I thought this would be fine before getting this error.

Greatful for your insights, best

Marcel-Mueck commented 6 months ago

Hey Jonas, could you check the 'vep_deepripe.parquet' file in the annotation output directory for how many unique variants there are to help pin down the issue?

Jonas-B-Frank commented 6 months ago

It's 11610443 as well.

Marcel-Mueck commented 6 months ago

Hey Jonas, thank you for helping to pin it down. One thing that might have gone wrong is that the vcf files that were used as an input for the annotation pipeline, do not match the normalized variants in the variants.parquet file. To make sure that the annotation input is normalized the same way the variants are normalized, I would recommand using the BCF files in preprocessing/norm/bcf as an input for the annotation pipeline( i.e. setting the source_variant_dir variable in the annotation config to preprocessing_workdir/norm/bcf/ and the source_variant_file_type to bcf). One other thing to make sure is that the list in the included chromosomes variable matches in the preprocessing and annotation config.

Jonas-B-Frank commented 6 months ago

Using the normalized bcf files might be the solution for me. I am testing this and will get back to you Thank you for your help!

Jonas-B-Frank commented 6 months ago

I once again get an error in the merge_annotations rule, regarding the process_vepfunction:

Traceback (most recent call last):
  File "PATH/miniconda3/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
  File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "PATH/miniconda3/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "PATH/deeprvat/annotations/annotations.py", line 1650, in merge_annotations
    vep_df = process_vep(
  File "/hpc/gpfs2/scratch/g/bioinf-neuro/frajonas/deepRVAT_new/deeprvat/annotations/annotations.py", line 1754, in process_vep
    assert (
AssertionError

This error occurs for all processed chromosomes. I checked the counts of unique values of the uploaded variation column (in the form of chr_pos_ref_alt) in annotations/tmp/chr_variants.vcf and annotations/chr_vep_anno.tsv and they match.

I did notice however that multiallelic variants which have been normalized correctly into two entries with different alt in the variants.vcf file, still have the same uploaded variations column, e.g.: chr1_15274_A_G;chr1_15274_A_T

For this variant, in the annotations/chr_vep_anno.tsv i only find one of the two variants in the uploaded variations column, with A to T not being listed in this column anymore (see below):

chr1_15274_A_G  1:15274 G
chr1_15274_A_G  1:15274 T

Maybe this leads to the unexpected behaviour?

Marcel-Mueck commented 6 months ago

Dear Jonas, I looked into the issue and found a way to circumvent it by regenerating the IDs for the VCFs that are used as input for VEP. This makes sure that each variant has a unique ID and is named accordingly in the VEP output. I created a PR which should be merged into main soon :)

Jonas-B-Frank commented 6 months ago

Hey Marcel, the annotation pipeline runs through with your fix. Thanks!

Jonas-B-Frank commented 6 months ago

Hello all, after running the annotations pipeline without error, in the seed gene pipeline in regress_plof, I get an error related to #70:

Traceback (most recent call last):
  File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 641, in run_association
    all_samples = np.array([int(i) for i in data_full["sample"]])
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 641, in <listcomp>
    all_samples = np.array([int(i) for i in data_full["sample"]])
ValueError: invalid literal for int() with base 10: 'str_sample_name'

Sample names are cast to integers, throwing an error of course for str sample names.

Furthermore, I adjusted the seed_gene_discovery.py to incorporate Variants with 0 <= AF<0.001:

rare_threshold_config[maf_column] = (
            f"{maf_column} < {rare_maf} and {maf_column} >= 0"
        )

Otherwise, all variants are filtered out. I checked AF and I only have AF = 0 and AF >= 0.05. I am currently running on a test dataset, which might cause this, but the 0.05 threshold seems at least suspicious.

Also, i commented out one line in the call of dataloader, since batch_size is passed twice to the function otherwise (because it is defined in the config file and has different values for different steps, which is why I didn't remove it from the config file).

dataloader = DataLoader(
        dataset,
        #batch_size=batch_size,  # reading from dataloader config
        collate_fn=dataset.collate_fn,
        **data_config["dataloader_config"],
    )

HolEv commented 5 months ago

Hi @Jonas-B-Frank, Thank you for reporting this issue. @endast and me have addressed the issue related to casting sample IDs to integers in #108 and merged the changes into main. Sample ids are not casted into integers any more now.

Regarding your second issue, the removal of all variants, this seems to be related to something weird happening in your data set (as you said, maybe related to you using a test data set). I don't quite understand how you can have variants with AF = 0 since they should get filtered out during the preporcessing. Still, let us know if we can help with that!

HolEv commented 2 months ago

Hi @Jonas-B-Frank, I just wanted to double check if there is anything related to this issue which needs to be resolved? Otherwise I will close this issue in the next weeks.

PMBio / deeprvat

Failing save_merge in association_dataset in seed gene discovery pipeline #76