epi2me-labs / wf-human-variation

Other
104 stars 45 forks source link

Valid SNPIDs missing in vcf file #221

Open matomol opened 1 month ago

matomol commented 1 month ago

Operating System

Ubuntu 22.04

Other Linux

No response

Workflow Version

latest

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

    nextflow run epi2me-labs/wf-human-variation \
    --out_dir ${VAR_DIR} \
    -w ${WORK_DIR} \
    --bam $ALN_DIR \
    --ref ${REFERENCE[$ARG1]} \
    --sample_name ${ARG0} \
    --bed $BED_REFERENCE/${REF_TYPE[$ARG1]}/hg38bed.bed \
    --bam_min_coverage 5 \
    --snp \
    --sv \
    --mod \
    --phased \
    --cnv \
    --str \
    -profile standard

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

I performed an analysis of the snp.vcf file and realized that although the variantes are correctly annotated the SNPID is missing. This is in particular true for SNPID with higher numbers, so that I assume that an outdated SNP reference database is used.

Relevant log output

Here are some examples:
Correctly annotated and the correct SNPID attached with the following SNPs
snpid   alleles     reference   alternatives
5930    (A, G)      A           (G,)
5927    (A, G)      A           (G,)

Correctly annotated but the proper SNPID missing are tzhe following variants
rs45508991,  rs72658861, rs11669576

Application activity log entry

There is nothing unusual. The output is of a normal basecalling.

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

no
matomol commented 1 month ago

Sorry a typo. Not basecalling but variant calling, of course.

vlshesketh commented 4 weeks ago

Hi @matomol, apologies for the delay in responding. Please can you provide a bit more information so I can assist you better - by 'snpid', do you mean the dbSNP identifier? We perform annotation with SnpEff as follows: first to add basic annotations, and then to annotate using ClinVar. The ClinVar VCF we use is out of date so we are in the process of updating that, but there won't be any dbSNP/rsIDs in the output VCFs as we are not using this dataset to annotate.

matomol commented 4 weeks ago

Please find below the summary that I did for just one gene, LDLR. A similar statistics is prepared for all the genes on the Illumina Panel once we succeded to lift it over successfully.

The ClinVar VCF we use is out of date so we are in the process of updating that Well maybe that will solve most of the problem.

Correctly annotated and with SNPID attached are only the two following SNPs in that region tested

snpid alleles reference alternatives 5930 (A, G) A (G,) 5927 (A, G) A (G,)

The following SNPIDs where correctly found by Illumina and Nanopore, but only Illumina attached the correct SNPID.

rs11669576 ANN = ('A|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_000527.5|protein_coding|8/18|c.1171G>A|p.Ala391Thr|1257/5173|1171/2583|391/860||', 'A|missense_variant|MODERATE|LDLR|LDLR|transcript|XM_011528010.2|protein_coding|8/17|c.1171G>A|p.Ala391Thr|1288/5126|1171/2505|391/834||', 'A|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195798.2|protein_coding|8/18|c.1171G>A|p.Ala391Thr|1257/5167|1171/2577|391/858||', 'A|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195799.2|protein_coding|7/17|c.1048G>A|p.Ala350Thr|1134/5050|1048/2460|350/819||', 'A|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195800.2|protein_coding|6/16|c.667G>A|p.Ala223Thr|753/4669|667/2079|223/692||', 'A|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195803.2|protein_coding|7/16|c.790G>A|p.Ala264Thr|876/4639|790/2049|264/682||', 'A|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|NR_106946.1|pseudogene||n.-1850G>A|||||1850|', 'A|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|unassigned_transcript_3212|miRNA||n.-1855G>A|||||1855|', 'A|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|unassigned_transcript_3213|miRNA||n.-1887G>A|||||1887|', 'A|non_coding_transcript_exon_variant|MODIFIER|LDLR|LDLR|transcript|XR_001753685.2|pseudogene|8/18|n.1288G>A||||||', 'A|non_coding_transcript_exon_variant|MODIFIER|LDLR|LDLR|transcript|XR_001753686.2|pseudogene|8/17|n.1288G>A||||||')

rs72658861 ANN = ('C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|NM_000527.5|protein_coding|7/17|c.1061-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|XM_011528010.2|protein_coding|7/16|c.1061-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|XR_001753685.2|pseudogene|7/17|n.1178-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|XR_001753686.2|pseudogene|7/16|n.1178-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|NM_001195798.2|protein_coding|7/17|c.1061-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|NM_001195799.2|protein_coding|6/16|c.938-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|NM_001195800.2|protein_coding|5/15|c.557-8T>C||||||', 'C|splice_region_variant&intron_variant|LOW|LDLR|LDLR|transcript|NM_001195803.2|protein_coding|6/15|c.680-8T>C||||||', 'C|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|NR_106946.1|pseudogene||n.-1968T>C|||||1968|', 'C|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|unassigned_transcript_3212|miRNA||n.-1973T>C|||||1973|', 'C|upstream_gene_variant|MODIFIER|MIR6886|MIR6886|transcript|unassigned_transcript_3213|miRNA||n.-2005T>C|||||2005|')

rs45508991 ANN = ('T|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_000527.5|protein_coding|15/18|c.2177C>T|p.Thr726Ile|2263/5173|2177/2583|726/860||', 'T|missense_variant|MODERATE|LDLR|LDLR|transcript|XM_011528010.2|protein_coding|15/17|c.2177C>T|p.Thr726Ile|2294/5126|2177/2505|726/834||', 'T|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195798.2|protein_coding|15/18|c.2177C>T|p.Thr726Ile|2263/5167|2177/2577|726/858||', 'T|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195799.2|protein_coding|14/17|c.2054C>T|p.Thr685Ile|2140/5050|2054/2460|685/819||', 'T|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195800.2|protein_coding|13/16|c.1673C>T|p.Thr558Ile|1759/4669|1673/2079|558/692||', 'T|missense_variant|MODERATE|LDLR|LDLR|transcript|NM_001195803.2|protein_coding|13/16|c.1643C>T|p.Thr548Ile|1729/4639|1643/2049|548/682||', 'T|non_coding_transcript_exon_variant|MODIFIER|LDLR|LDLR|transcript|XR_001753685.2|pseudogene|15/18|n.2511C>T||||||', 'T|non_coding_transcript_exon_variant|MODIFIER|LDLR|LDLR|transcript|XR_001753686.2|pseudogene|14/17|n.2154C>T||||||')

Lastly, this SNP was not detected by Nanopore.

rs2738442 not detected by nNanopore sequencing SNP NM_000527.5:c.1060+7C>A