brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
255 stars 35 forks source link

Suggestion: use groups info (same individual) to infer expected relatedness #26

Closed fgvieira closed 4 years ago

fgvieira commented 4 years ago

somalier relate has two command-line options to control relatedness between samples: --groups and --ped. From what I could see, the former assigns groups of samples from the same individual, while the second allows for specifying a pedigree to infer relatedness between different individuals.

However, the information on --groups is not used to infer pedigree relatedness. I mean, if I have this --groups file:

normal0,tumor0

and this --ped file:

FAM001  normal0  normal1   normal2     2       -9
FAM001  normal1  0         0           1       -9
FAM001  normal2  0         0           2       -9

the inferred relatedness will be:

normal0 / normal1 = 0.5
normal0 / normal2 = 0.5
normal1 / normal2 = -1
tumor0 / normal1 = -1
tumor0 / normal2 = -1

However, it would be very nice if the last two could also inferred to be 0.5, since they are the same individual as normal0.

brentp commented 4 years ago

hi, thanks for the clear report. Please give the attached binary a try. It also relaxes the allele balance cutoff to 0.04 (and 0.96) from 0.02 (and 0.98). somalier.gz

fgvieira commented 4 years ago

Super, the groups seem to be working fine! :+1:

As for the thresholds, those seem to work a bit better (even for normal WGS where you remove PCR duplicates) but, for my specific case (cannot remove PCR duplicates), the proportion of other alleles also seems to be a bit too strict (0.04).

I made some quick checks, and it seems that removing sites where proportion_other > 0.1 seems to work better for me; but I can understand that under normal circumstances (i.e. when removing PCR duplicates) this might seem a bit too high.

Since sensible thresholds might depend on specific library cases, I'd say the best would be to allow for the user to define them at command line (with sensible default values: ab_cutoff = 0.04; proportion_other = 0.04).

brentp commented 4 years ago

Thanks for verifying the groups+ped are correctly sharing information.

I'd rather not add more options unless necessary. I think setting proportion_other cutoff to 0.1 is reasonable. And I don't see any problems with that. Here is a binary with just that change. If that looks good to you, I'll prepare for next release.

It may be, that in the future, it becomes necessary to expose these options in some way, but I think for now, that's not needed. somalier.gz

fgvieira commented 4 years ago

Yep, looks much better now! :+1: thanks for your help!

brentp commented 4 years ago

this is fixed in v0.2.4