brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
255 stars 35 forks source link

Feature request: output per site info (e.g. genotype, depth) #36

Open fgvieira opened 4 years ago

fgvieira commented 4 years ago

Dear all,

would it be possible to get more detailed per-site info for QC? Right now somalier outputs only per sample and pairs of samples info.

There is already something similar on depthview, but it is very broad and only on HTML. Would it be possible to get that info on a TSV also? Maybe reporting for each site (rows) and each individual (columns) the coverage for each allele as well as somalier's called genotype.

thanks,

brentp commented 4 years ago

is this still needed? i am very hesitant to add this, but it could be a debug option.

fgvieira commented 4 years ago

I agree that it would be nice to have as a debug option.

asp8200 commented 3 years ago

I would also appreciate more detailed per-site info in the output from somalier.

Would it perhaps be possible for the extract-function, in addition to the .somalier-files, to output a TSV-file (or something kind of text-format-file) with genomic positions, readcounts, REF, ALT and genotype-calls, that is, some like the following:

chr position    nref    nalt    nother  REF ALT GT
chr2    20616424    184 171 1   C   T   HET
chr4    165697039   0   328 0   G   T   HOM_ALT
chr4    190318079   290 0   0   C   G   HOM_REF
chr6    165045333   0   283 0   G   T   HOM_ALT
...
brentp commented 3 years ago

Hi, you can write this using a simple python script that accepts the sites file and a somalier file (or many somalier files). Here is a function that will read the sites data into a python structure for you: https://github.com/brentp/somalier/blob/master/scripts/ancestry-predict.py#L7

The sites is an array with n_sites rows and 2 columns where first column is ref depth and 2nd is alt depth.