brentp / slivar

genetic variant expressions, annotation, and filtering for great good.
MIT License
248 stars 23 forks source link

Wildcards in group subsetting? #164

Open phillip-richmond-umoja opened 8 months ago

phillip-richmond-umoja commented 8 months ago

Question not bug:

Can you select wildcards for subsetting across samples for a genotype?

My use case: Single ancestor line, multiple clones derived from it. Alias file:

#ancestor   clone
NH50191 C3
NH50191 C56
NH50191 C3-1
NH50191 C56-1

Code chunk with subset expression:

## slivar filter 
    slivar expr \
    --vcf $vcf \
    --alias $alias_samplesheet \
    --pass-only \
    --info 'INFO.impactful' \
    --group-expr "Clonal_impact: INFO.gnomad_popmax_af <= 0.0001 && ancestor.AD[1] <= $max_ancestor_reads_for_clonal  && clone.AD[1] >= $min_clone_reads_for_clonal" \
       -o Clonal_impactful_snvindel.vcf.gz  

If I want any clone.AD[1] >= value; or all clone.AD[1] is that possible with the --group-expr function?

One workaround could be changing the alias file to

#ancestor   clone1    clone2    clone3    clone4
NH50191 C3    C3-1    C56    C56-1

and expanding the group expression to

--group-expr "Clonal_impact: INFO.gnomad_popmax_af <= 0.0001 && ancestor.AD[1] <= $max_ancestor_reads_for_clonal  && clone1.AD[1] >= $min_clone_reads_for_clonal || clone2.AD[1] >= $min_clone_reads_for_clonal || clone3.AD[1] >= $min_clone_reads_for_clonal || clone4.AD[1] >= $min_clone_reads_for_clonal ||" \

But that requires that || (or) is implemented in your subsetting code.

Any advice?

Thanks, Phil

For example, I want to access AD[1] or DP acro

brentp commented 8 months ago

Hi Phil, does your last expression not work as expected? The expressions are javascript so || should work fine.

phillip-richmond-umoja commented 8 months ago

Matter of fact it does. This works fine

    slivar expr \
    --vcf $vcf \
    --alias $alias_samplesheet_separate \
    --pass-only \
    --info 'INFO.impactful' \
    --group-expr "Clonal_impact: INFO.gnomad_popmax_af <= 0.0001 && ancestor.AD[1] <= $max_ancestor_reads_for_clonal && \
    ( clone1p1.AD[1] >= $min_clone_reads_for_clonal || clone1p2.AD[1] >= $min_clone_reads_for_clonal || clone2p1.AD[1] >= $min_clone_reads_for_clonal || clone2p2.AD[1] >= $min_clone_reads_for_clonal)" \
       -o Clonal_impactful_snvindel_withOR.vcf.gz   

But problem here is I have to know the names of each clone, as opposed to a more generalized "clone" label. Just curious if there is any such operation as clone.each or clone.any or clone.all when referring back to a sample from the column titled "clone". If not I can keep it hard coded in the pipeline...such is the challenge of pipeline flexibility to differing inputs.

If no such operation exists it's a bummer and you can close the thread...it's been tough to move away from GEMINI in python2...and I'm missing operations like https://gemini.readthedocs.io/en/latest/content/querying.html#gt-filter-wildcard-filtering-on-genotype-columns

Thanks for a great tool though! -Phil

brentp commented 8 months ago

I see. I think you can use:

ancestor.every(function(a) { a.AD[1] < 1}) && clone.some(function(c) { c.AD[1] < 0 })

that expression probably doesn't do anything useful, but shows how you'd use every and some

I'm still confused about exactly what you want to do. usually you'd have:

#ancestors    clones
A1,A2,A3        C1,C2,C3,C4...

then ancesors and clones will both be javascript arrays (because column ends in 's'.