Open alexiswl opened 2 years ago
Also in this release, in the example, the docker tag is set as 0.2.1.6 instead of 0.2.16
Hi, the releases page indicates that the RNA sites file is different coordinates from the others: https://github.com/brentp/somalier/releases/tag/v0.2.16
I updated that last week and made it more clear, I think.
I've now updated the docker tag. Thanks for reporting. We could also ask @kpalin if it's much trouble to generate a set of T2T sites corresponding to the RNA sites.
Thanks for clarifying @brentp, for me it wasn't clear from the release page that just because the sites file had a different number of sites, fingerprints generated with this sites file wouldn't be compatible with the rest of the references.
Would it be possible to have a hg19 RNA sites file? I can create a separate issue for this request. We're using the hg38 RNA sites file for our WGS and WTS workflows but cannot compare these to our ctDNA TSO500 samples (since Illumina's ctTSO pipeline uses a hg19 reference). We're not sure whether to: a) re-run somalier extract on our WGS and WTS samples with the standard hg38 reference so we can compare these somalier fingerprints to our ctTSO fingerprints or b) just realign the ctTSO samples to the hg38 reference so we can run them against the hg38 RNA sites file.
Your input would be very much appreciated!
Alexis
It is a bit of a pain to adjust the few troublesome SNPs (12 at this case) but here it is sites.chm13v2.rna.vcf.gz
The troublesome sites (deleted in T2T assembly) are below, placed on (hopefully) correct side and near the closest correctly liftedover SNP. The chm13v2 coordinates are on columns 1 and 2.
chr1 153756 GRCh38:chr1:633209:T
chr1 1057895 GRCh38:chr1:1635543:T
chr2 110077423 GRCh38:chr2:110178290:A
chr5 675737 GRCh38:chr5:732263:G
chr5 675743 GRCh38:chr5:711356:T
chr5 729653 GRCh38:chr5:766918:A
chr6 32383328 GRCh38:chr6:32519491:A
chr13 31419757 GRCh38:chr13:31961927:T
chr13 113023764 GRCh38:chr13:113803204:A
chr15 82322678 GRCh38:chr15:84398863:G
chr15 82322680 GRCh38:chr15:84391954:A
chrX 117843796 GRCh38:chrX:119470473:T
Here's somalier relate I got for one sample of Nanopore WGS with GRCh38 and chm13v2 sites (same data, different basecall and reference).
relatedness 0.896
hom_concordance 0.731
hets_a 5936
hets_b 5976
shared_hets 5328
hom_alts_a 3133
hom_alts_b 3110
shared_hom_alts 2281
ibs0 4
ibs2 9330
n 9480
x_ibs0 2
x_ibs2 296
expected_relatedness -1
thank you! @alexiswl does this work for you? if so I'll add to the sites in the releases.
@brentp we're after a hg19 RNA sites file. I think the file above generated by @kpalin is an RNA file for the chm13v2 assembly
@alexiswl , I think it's best to re-extract with a pair of matching sites.
@brentp would a slightly out of order vcf work?
For the 24 SNPs that couldn't be mapped, some are in locations that are in region lower / or higher than the next snp.
i.e merging the liftover and rejected we get
chr1 147905175 . G T 100 PASS OriginalContig=chr1;OriginalStart=148433037;SwappedAlleles;AC=86217;AF=0.398
chr1 149029906 . G A 100 NoTarget AC=105816;AF=0.773916
chr1 149818561 . G A 100 PASS OriginalContig=chr1;OriginalStart=149846994;AC=14557;AF=0.101699
Where 147905175 and 149818561 are references in the hg19 reference and from hg38 positions 148433037, and 149846994 respectively. 149029906 is a hg38 position in the rejected vcf.
I generated a 200 bp fastq for each of the rejected variants, re aligned them to both hg38 and hg19 to find their corresponding mapping positions and for 149029906 this became 144854487 (on the complement strand of the hg19 reference) - see screenshot below. So a part of the genome that has been flipped between the hg19 and hg38 reference
If I update this new vcf as
chr1 147905175 . G T 100 PASS OriginalContig=chr1;OriginalStart=148433037;SwappedAlleles;AC=86217;AF=0.398
chr1 144854487 . T G 100 Manual OriginalContig=chr1,OriginalStart=149029906,AC=105816;AF=0.773916
chr1 149818561 . G A 100 PASS OriginalContig=chr1;OriginalStart=149846994;AC=14557;AF=0.101699
The new vcfs technically now out of position order 147905175, 144854487, 149818561 but is of the same order as the original hg38 (necessary since the somalier binary does not store positional information as far as I know).
Would this cause an error when running the somalier extract command? Or is this a valid workaround?
I'd rather not have to re-extract the sites from both our WGS and TSO data
Alexis
Hi Brent,
I've built a hg19 rna sites file (that's slightly out of order, necessary to maintain positional memory from the hg38 sites file).
I've managed to test concordance across genome builds, with bam files from the same subject aligned to hg38 and hg19 references and it works great!
Attached is the notebook (in zip) used to generate the file AND the output sites file itself. 24 variants could NOT be matched with picard's liftover capability.
I was able to find 15 of these variants easily through vcf -> bed -> fastq remap -> manual IGV assessment.
For six of the remaining 9 variants, I was able to approximate the target by determining the expected distance between the adjacent successful targets and used a position in the hg19 WGS sites file that resided mostly closely to this expected position.
For the remaining three variants, the expected position of these on hg19 was N. I was able to leave these as 'N' without consequence.
In the v0.2.16 release it states
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
When running
I get
So assume this cross-reference does not apply for somalier files generated with the hg38 rna sites file?