brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

known mouse sites #91

Open igordot opened 2 years ago

igordot commented 2 years ago

This is probably more of a question for other users rather than the developers. There are known polymorphic sites provided for human hg19/hg38 genome. Is there a version available for mouse sites? Since that is likely not the case, is there somewhere where one can get a population VCF with all the necessary info present for find-sites?

brentp commented 2 years ago

Hi, did you try dbsnp? If that doesn't work, I can try to help update find-sites so that it will.

igordot commented 2 years ago

I tried dbSNP (ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz) and EVA RefSNP (ftp://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_2/by_species/mus_musculus/GRCm38.p4/GCA_000001635.6_current_ids.vcf.gz).

It runs without errors, but also without results:

somalier version: 0.2.15
on chrom:1
on chrom:10
on chrom:11
on chrom:12
on chrom:13
on chrom:14
on chrom:15
on chrom:16
on chrom:17
on chrom:18
on chrom:19
on chrom:2
on chrom:3
on chrom:4
on chrom:5
on chrom:6
on chrom:7
on chrom:8
on chrom:9
on chrom:MT
on chrom:X
on chrom:Y
0 candidate variants
sorted and filtered to 0 variants. now dropping INFOs and writing
[somalier] wrote 0 variants to:sites.vcf.gz
brentp commented 2 years ago

looks like it's requiring AF right now. You could post-process the 00-All.vcf.gz to add AF from the bitfield as here: https://www.biostars.org/p/3877/#107953 The other VCF that you link doesn't have AF encoded or otherwise, so it's probably not a good one to use.

brentp commented 2 years ago

Maybe you could also use this: https://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/genotype/SC_MOUSE_GENOMES.genotype.vcf.gz and add AF from the lines in the genotypes.

The relatedness calculation won't work very well without heterozygotes, but you should still see clear clustering of samples based on IBS0 and IBS2.

igordot commented 2 years ago

Thank you for following up. That's an interesting idea about combining with the genotypes VCF.

I don't usually deal with VCFs and especially writing them. Do you know if there is an easier way of doing that rather than just parsing each line manually?