brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

Clarification in relate command across genome builds #108

Open alexiswl opened 1 year ago

alexiswl commented 1 year ago

In the v0.2.16 release it states

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

When running

docker run \
  -w $PWD \
  -v $PWD:$PWD \
  docker.io/brentp/somalier:v0.2.16 \
    somalier relate \
      hg38/PTC_Tsqn221025.somalier \
      hg38_rna/PTC_Tsqn221025.somalier

I get

somalier version: 0.2.16
[somalier] starting read of 2 samples
fatal.nim(53)            sysFatal
Error: unhandled exception: relate.nim(431, 16) `nsites == last_nsites`  [AssertionDefect]

So assume this cross-reference does not apply for somalier files generated with the hg38 rna sites file?

alexiswl commented 1 year ago

Also in this release, in the example, the docker tag is set as 0.2.1.6 instead of 0.2.16

brentp commented 1 year ago

Hi, the releases page indicates that the RNA sites file is different coordinates from the others: https://github.com/brentp/somalier/releases/tag/v0.2.16

I updated that last week and made it more clear, I think.

I've now updated the docker tag. Thanks for reporting. We could also ask @kpalin if it's much trouble to generate a set of T2T sites corresponding to the RNA sites.

alexiswl commented 1 year ago

Thanks for clarifying @brentp, for me it wasn't clear from the release page that just because the sites file had a different number of sites, fingerprints generated with this sites file wouldn't be compatible with the rest of the references.

Would it be possible to have a hg19 RNA sites file? I can create a separate issue for this request. We're using the hg38 RNA sites file for our WGS and WTS workflows but cannot compare these to our ctDNA TSO500 samples (since Illumina's ctTSO pipeline uses a hg19 reference). We're not sure whether to: a) re-run somalier extract on our WGS and WTS samples with the standard hg38 reference so we can compare these somalier fingerprints to our ctTSO fingerprints or b) just realign the ctTSO samples to the hg38 reference so we can run them against the hg38 RNA sites file.

Your input would be very much appreciated!

Alexis

kpalin commented 1 year ago

It is a bit of a pain to adjust the few troublesome SNPs (12 at this case) but here it is sites.chm13v2.rna.vcf.gz

The troublesome sites (deleted in T2T assembly) are below, placed on (hopefully) correct side and near the closest correctly liftedover SNP. The chm13v2 coordinates are on columns 1 and 2.

chr1    153756  GRCh38:chr1:633209:T
chr1    1057895 GRCh38:chr1:1635543:T
chr2    110077423       GRCh38:chr2:110178290:A
chr5    675737  GRCh38:chr5:732263:G
chr5    675743  GRCh38:chr5:711356:T
chr5    729653  GRCh38:chr5:766918:A
chr6    32383328        GRCh38:chr6:32519491:A
chr13   31419757        GRCh38:chr13:31961927:T
chr13   113023764       GRCh38:chr13:113803204:A
chr15   82322678        GRCh38:chr15:84398863:G
chr15   82322680        GRCh38:chr15:84391954:A
chrX    117843796       GRCh38:chrX:119470473:T

Here's somalier relate I got for one sample of Nanopore WGS with GRCh38 and chm13v2 sites (same data, different basecall and reference).

relatedness                                    0.896
hom_concordance                                0.731
hets_a                                          5936
hets_b                                          5976
shared_hets                                     5328
hom_alts_a                                      3133
hom_alts_b                                      3110
shared_hom_alts                                 2281
ibs0                                               4
ibs2                                            9330
n                                               9480
x_ibs0                                             2
x_ibs2                                           296
expected_relatedness                              -1
brentp commented 1 year ago

thank you! @alexiswl does this work for you? if so I'll add to the sites in the releases.

alexiswl commented 1 year ago

@brentp we're after a hg19 RNA sites file. I think the file above generated by @kpalin is an RNA file for the chm13v2 assembly

brentp commented 1 year ago

@alexiswl , I think it's best to re-extract with a pair of matching sites.

alexiswl commented 1 year ago

@brentp would a slightly out of order vcf work?

For the 24 SNPs that couldn't be mapped, some are in locations that are in region lower / or higher than the next snp.

i.e merging the liftover and rejected we get

chr1    147905175       .       G       T       100     PASS    OriginalContig=chr1;OriginalStart=148433037;SwappedAlleles;AC=86217;AF=0.398
chr1    149029906       .       G       A       100     NoTarget        AC=105816;AF=0.773916
chr1    149818561       .       G       A       100     PASS    OriginalContig=chr1;OriginalStart=149846994;AC=14557;AF=0.101699

Where 147905175 and 149818561 are references in the hg19 reference and from hg38 positions 148433037, and 149846994 respectively. 149029906 is a hg38 position in the rejected vcf.

I generated a 200 bp fastq for each of the rejected variants, re aligned them to both hg38 and hg19 to find their corresponding mapping positions and for 149029906 this became 144854487 (on the complement strand of the hg19 reference) - see screenshot below. So a part of the genome that has been flipped between the hg19 and hg38 reference

image

If I update this new vcf as

chr1    147905175       .       G       T       100     PASS    OriginalContig=chr1;OriginalStart=148433037;SwappedAlleles;AC=86217;AF=0.398
chr1    144854487       .       T       G       100     Manual        OriginalContig=chr1,OriginalStart=149029906,AC=105816;AF=0.773916
chr1    149818561       .       G       A       100     PASS    OriginalContig=chr1;OriginalStart=149846994;AC=14557;AF=0.101699

The new vcfs technically now out of position order 147905175, 144854487, 149818561 but is of the same order as the original hg38 (necessary since the somalier binary does not store positional information as far as I know).

Would this cause an error when running the somalier extract command? Or is this a valid workaround?

I'd rather not have to re-extract the sites from both our WGS and TSO data

Alexis

alexiswl commented 1 year ago

Hi Brent,

I've built a hg19 rna sites file (that's slightly out of order, necessary to maintain positional memory from the hg38 sites file).

I've managed to test concordance across genome builds, with bam files from the same subject aligned to hg38 and hg19 references and it works great!

Attached is the notebook (in zip) used to generate the file AND the output sites file itself. 24 variants could NOT be matched with picard's liftover capability.

I was able to find 15 of these variants easily through vcf -> bed -> fastq remap -> manual IGV assessment.

For six of the remaining 9 variants, I was able to approximate the target by determining the expected distance between the adjacent successful targets and used a position in the hg19 WGS sites file that resided mostly closely to this expected position.

For the remaining three variants, the expected position of these on hg19 was N. I was able to leave these as 'N' without consequence.

sites.hg19.rna.vcf.gz Create HG19 RNA Sites.zip