brentp / slivar

genetic variant expressions, annotation, and filtering for great good.
MIT License
248 stars 23 forks source link

Using groups with multiple samples results in error #88

Closed edg1983 closed 3 years ago

edg1983 commented 3 years ago

Hi, I have a multi-sample VCF containing 2 groups of individuals: highrisk and lowrisk. I want to annotate variants that are present in highrisk samples and absent from lowrisk ones.

To do so I've prepared a groups.txt file as follows:

#lowrisk        highrisk
T10945,T10926,T11258,T10965     T10967,T10947,T10940

Then, following the example in group wiki, I've prepared the following functions.js file:

function proportion_with_alt(samples) {
    var n = 0
    for(i=0; i < samples.length; i++){
      if(samples[i].alts == 1 || samples[i].alts == 2 && samples[i].GQ >=10){
          n += 1
      }
    }
    return n / samples.length
}

function proportion_with_ref(samples) {
    var n = 0
    for(i=0; i < samples.length; i++){
      if(samples[i].alts == 0 && samples[i].AB < 0.01 && samples[i].GQ >=10){
          n += 1
      }
    }
    return n / samples.length
}

However, when I try to run the slivar command (slivar v.0.2.1):

slivar expr --vcf input.vcf.gz --js functions.js --alias groups.txt --pass-only --skip-non-variable --group-expr "highonly:proportion_with_alt(highrisk) == 1 && proportion_with_ref(lowrisk) == 1"

I got this error, complaining that multiple samples are present in column 0, which is exactly what I want and I suppose this should work based on your wiki example for groups - multi-generational pedigree

Error: unhandled exception: slivar/groups:got > 1 sample in line T10945,T10926,T11258,T10965    T10967,T10947,T10940, column 0. @[Sample(id:T10945, i:0, affected:false, dad: "", mom: ""), Sample(id:T10926, i:8, affected:false, dad: "", mom: ""), Sample(id:T11258, i:9, affected:false, dad: "", mom: ""), Sample(id:T10965, i:14, affected:false, dad: "", mom: "")] [ValueError]

Thanks for support!

brentp commented 3 years ago

Hi, glad you're trying out the groups. Columns labels with multiple samples must end in 's'. So you can use:

#lowrisks        highrisks
T10945,T10926,T11258,T10965     T10967,T10947,T10940

(note lowrisks instead of lowrisk and same for highrisks).

I'll improve this error message to give more help. let me know any more questions.

edg1983 commented 3 years ago

Great thanks! I completely missed the "s" was mandatory in column names.

I want also to report a couple of minor issues:

  1. When running pslivar I got progress only for chr1

    [slivar] 17 samples matched in VCF and PED to be evaluated
    [slivar] 10000 chr1:2989724 evaluated 10000 variants in 1.6 seconds (6355.4/second)
    [slivar] 20000 chr1:5044483 evaluated 10000 variants in 1.3 seconds (7864.9/second)
    [slivar] 100000 chr1:24159830 evaluated 100000 variants in 10.2 seconds (9771.0/second)
    [slivar] Finished. evaluated 103042 total variants and wrote 168 variants that passed your slivar expressions.
    [slivar] wrote summary table to:/tmp/pslivar118847801/0-0a.summary.tsv

    Then nothing else until the end of the process. This is not a real issue, but first time I used pslivar I was unsure if it completed successfully given the above output.

  2. I've found this "please report" warning (annotations from snpEff), so I'm pasting it here:

    unknown impact "non_canonical_start_codon" from csq "C|initiator_codon_variant&non_canonical_start_codon|LOW|VPS13A|ENSG00000197969|transcript|ENST00000467124.1|protein_coding|1/3|c.1T>C|p.Leu1?|3/324|1/123|1/40||"
brentp commented 3 years ago

I updated the error message to be more informative. For 1. I see what you mean. That is a big confusing. You did see something like: [slivar] writing final summary file to ... at the end of the run, right? I'll think about how to improve the output here. It was really to give an idea of how fast things are going but it's only reported for the first thread.

For 2. This was reported and will be out in next release. (see f55286a)

Thanks for reporting all of these.

edg1983 commented 3 years ago

Happy to help! One last question. Which is the best approach to extract variants using slivar tsv after the above analysis with group expression? I've tried providing a fake PED file, but I do not get the genotypes. Is there any way to allow slivar tsv to read groups from the group file and report genotypes per group?

brentp commented 3 years ago

slivar tsv currently only works on trios. I'll open a new issue for that. It could indeed be useful to expand it.