brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

sites.vcf for chm13v2 (T2T) reference #107

Open kpalin opened 1 year ago

kpalin commented 1 year ago

Attached is a sites.chm13v2.vcf.gz file for chm13v2 reference which is approximately compatible with the provided sites files. It will likely provide 5 conflicts for following variants whose location is non obvious (i.e. liftover fails) in the new reference

chr1    248522418       chr1_248522418_A_T
chrX    104019658       chrX_104019658_G_A
chrX    149715688       chrX_149715688_C_T
chrY    6246522 chrY_6246522_A_G
chrY    22513968        chrY_22513968_T_C
brentp commented 1 year ago

Hi Kimmo, thanks for creating this. What is in those 5 sites now? Are they included or removed? Since there's only 1 in the autosome and the chrX and Y are only used for depth, then I think this will be fine.

kpalin commented 1 year ago

They are "nearby" sites with appropriate reference allele. The new coordinates below.

chr1    247970375       chr1_248522418_A_T
chrX    102460255       chrX_104019658_G_A
chrX    147953346       chrX_149715688_C_T
chrY    10226948        chrY_6246522_A_G
chrY    22907941        chrY_22513968_T_C

I got following with 89Gbp Nanopore WGS after re-basecalling and comparing the alignments to chm13v2 and GRCh38

#sample_a       sample_b        relatedness     hom_concordance hets_a  hets_b  shared_hets     hom_alts_a      hom_alts_b      shared_hom_alts ibs0    ibs2    n       x_ibs0  x_ibs2  expected_relatedness
My6606T4_19_1323        GRCh38:My6606T4_19_1323 0.941   0.856   6543    6638    6170    3243    4213    2787    5       11237   11345   0       161     -1.0

Is it somehow possible to compare digests extracted with different (subset) of sites?

brentp commented 1 year ago

Is it somehow possible to compare digests extracted with different (subset) of sites?

It is not. But I'd like to make your sites file available. I am thinking about updating the sites files to exclude these 5 variants--and to include your set on the downloads. Or, we could simply include your set with the knowledge that it's close enough. Perhaps a better way to do it would be to have the 5 sites in your file point to non-variant sites (but match the reference to avoid an error message). Then it will just ignore them instead of having e.g. het sites that would match between samples, but do not.

kpalin commented 1 year ago

Feel free to add "my" sites file to downloads. If you decide to remove those 5 sites, do make sure to version the new site files clearly. My motivation for using "wrong" positions was to retain backward compatibility with my >4000 old digests.

brentp commented 1 year ago

Yes, I think I'll leave in those 5 sites for exactly the reason you state. Thanks very much!