brentp / slivar

genetic variant expressions, annotation, and filtering for great good.
MIT License
252 stars 23 forks source link

Column labels in annotation files #131

Closed priyambial123 closed 2 years ago

priyambial123 commented 2 years ago

Hello

Using the slivar, the small variants in the whole genome sequence (from PacBio) was annotated. I have some doubts on the some of the column labels in the annotated file.

In the genotype column, the number 1 and 2 indicates the number of alternate alleles? In the gnomad_ad, what does the number -1 mean, is it unknown allele frequency? From which populations was the allele frequency was determined?

In the gnomad_homalt and in gnomad_ac columns, what is -1?. I read the paper (doi.org/10.1038/s41525-021-00227-3) and I understand that Gnomad_homalt values are from GRCh38 assembly, does the value denote here?

Thanks Priya

brentp commented 2 years ago

Hi, this depends on the commands that you used for slivar expr and (I assume) slivar tsv. If you can share those, then I can help explain the columns. I expect that gnomad_nhomalt is the number across all populations.

-1 indicates that the variant is not present (to differentiate from 0, where the variant was present and then filtered) in gnomAD

for genotypes, yes, 1 and 2 indicate the number of alternate alleles.

priyambial123 commented 2 years ago

The below rules was run in the snakemake workflow

slivar_filters = [
        f"""--info 'variant.FILTER==\"PASS\" \
                && INFO.gnomad_af <= {config['max_gnomad_af']} \
                && INFO.hprc_af <= {config['max_hprc_af']} \
                && INFO.gnomad_nhomalt <= {config['max_gnomad_nhomalt']} \
                && INFO.hprc_nhomalt <= {config['max_hprc_nhomalt']}'""",
        "--family-expr 'recessive:fam.every(segregating_recessive)'",
        "--family-expr 'x_recessive:(variant.CHROM == \"chrX\") && fam.every(segregating_recessive_x)'",
        f"""--family-expr 'dominant:fam.every(segregating_dominant) \
                       && INFO.gnomad_ac <= {config['max_gnomad_ac']} \
                       && INFO.hprc_ac <= {config['max_hprc_ac']}'""",
        f"""--family-expr 'x_dominant:(variant.CHROM == \"chrX\") \
                       && fam.every(segregating_dominant_x) \
                       && INFO.gnomad_ac <= {config['max_gnomad_ac']} \
                       && INFO.hprc_ac <= {config['max_hprc_ac']}'""",
]
if singleton:
    # singleton
    slivar_filters.append(f"--sample-expr 'comphet_side:sample.het && sample.GQ > {config['min_gq']}'")
else:
    # trio cohort
    slivar_filters.append("--trio 'comphet_side:comphet_side(kid, mom, dad) && kid.affected'")
rule slivar_small_variant:
    input:
        bcf = f"cohorts/{cohort}/slivar/{cohort}.{ref}.deepvariant.phased.norm.bcf",
        csi = f"cohorts/{cohort}/slivar/{cohort}.{ref}.deepvariant.phased.norm.bcf.csi",
        ped = f"cohorts/{cohort}/{cohort}.ped",
        gnomad_af = {config['ref']['gnomad_gnotate']},
        hprc_af = {config['ref']['hprc_dv_gnotate']},
        js = config['slivar_js'],
        gff = config['ref']['ensembl_gff'],
        ref = config['ref']['fasta']
    output: f"cohorts/{cohort}/slivar/{cohort}.{ref}.deepvariant.phased.slivar.vcf"
    log: f"cohorts/{cohort}/logs/slivar/filter/{cohort}.{ref}.deepvariant.phased.slivar.vcf.log"
    benchmark: f"cohorts/{cohort}/benchmarks/slivar/filter/{cohort}.{ref}.deepvariant.phased.slivar.tsv"
    params: filters = slivar_filters
    threads: 12
    conda: "envs/slivar.yaml"
    message: "Executing {rule}: Annotating {input.bcf} and applying filters."
    shell:
        """
        (pslivar --processes {threads} \
            --fasta {input.ref}\
            --pass-only \
            --js {input.js} \
            {params.filters} \
            --gnotate {input.gnomad_af} \
            --gnotate {input.hprc_af} \
            --vcf {input.bcf} \
            --ped {input.ped} \
            | bcftools csq -l -s - --ncsq 40 \
                -g {input.gff} -f {input.ref} - -o {output}) > {log} 2>&1

Above rule was taken from this link: https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake/blob/main/rules/cohort_svpack.smk

Thanks

brentp commented 2 years ago

Yes, so what I have described above should hold. The values from gnomad are simply from pooling all samples and reporting, e.g. allele frequency or number of homalts.

priyambial123 commented 2 years ago

Thank you. So, the value of 2 in gnomad_nhomalt means that number of alternate alleles is same in 2 different populations?

brentp commented 2 years ago

the value of 2 in gnomad_nhomalt means that 2 samples in gnomad were homozygous for the alternate allele at that site.

priyambial123 commented 2 years ago

Thank you