brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

sm1.somalier and no other output #130

Closed methionine23 closed 9 months ago

methionine23 commented 9 months ago

Hello, I tried somalier version: 0.2.19, which worked well on DNA data (thank you!). But when I run rnaseq data (with gtex/topmed reference, and sites.hg38.rna.vcf.gz), somalier extract didn't generate sample.somalier but a file named sm1.somalier in the folder.

sm1.somalier looks binary and the beginning is:" 0209 4b47 4c37 3439 736d 3155 401a 0303 0001 0000 0042 0000 0002 0000 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0005 0000 0000 0000 0000 0000 0003 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0056 0000 0000 0000 0002 0000 0000 0000 0000 0000 00fb 0000 00bf 0000 0001 0000 0001 0000 0000 0000 0000 0000 0011 0000 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0006 0000 0000 0000 0000 0000 0000 0000 0002 0000 0000 0000 0000 0000 0000 0000

Any suggestions? Thanks again,

brentp commented 9 months ago

Hi, is your data in hg38 coordinates? You can see the sites in the somalier VCF, so I'd check the depth in your bam/cram at a few of those sites. What is the coverage?

methionine23 commented 9 months ago

Bam and fast are in hg38. Coverage on the site.hg38.rna.vcf is like below: chr1 633208 633209 . 100 T C PASS AC=27151;AF=0.214074 2363 chr1 727241 727242 . 100 G A PASS AC=21192;AF=0.158043 9 chr1 819122 819123 . 100 G A PASS AC=97190;AF=0.680021 2 chr1 841741 841742 . 100 A T PASS AC=105236;AF=0.735772 89 chr1 852304 852305 . 100 G T PASS AC=91146;AF=0.642317 96 chr1 917583 917584 . 100 T G PASS AC=106319;AF=0.74269 1 chr1 946246 946247 . 100 G A PASS AC=67798;AF=0.473655 115 chr1 965349 965350 . 100 G A PASS AC=101063;AF=0.706794 4 chr1 975554 975555 . 100 G A PASS AC=35718;AF=0.249379 599 chr1 995981 995982 . 100 G A PASS AC=53714;AF=0.37589 25 chr1 1014227 1014228 . 100 G A PASS AC=55560;AF=0.387994 42 chr1 1048921 1048922 . 100 T C PASS AC=70488;AF=0.495243 1 chr1 1055425 1055426 . 100 G A PASS AC=116092;AF=0.810551 7 chr1 1082763 1082764 . 100 T C PASS AC=71477;AF=0.499616 3 chr1 1091689 1091690 . 100 G A PASS AC=23743;AF=0.174946 1 chr1 1203821 1203822 . 100 T C PASS AC=24860;AF=0.173659 1 chr1 1217732 1217733 . 100 G A PASS AC=15419;AF=0.107654 181 chr1 1251121 1251122 . 100 A T PASS AC=30317;AF=0.21206 2 chr1 1288004 1288005 . 100 G C PASS AC=22126;AF=0.154859 3 chr1 1313806 1313807 . 100 G A PASS AC=78748;AF=0.549955 218 chr1 1327981 1327982 . 100 G A PASS AC=108038;AF=0.754297 133 chr1 1334605 1334606 . 100 G A PASS AC=14481;AF=0.101096 8 chr1 1353442 1353443 . 100 A G PASS AC=115068;AF=0.803952 23 chr1 1361678 1361679 . 100 G A PASS AC=93961;AF=0.65629 1 chr1 1374504 1374505 . 100 A G PASS AC=26173;AF=0.182734 1075 chr1 1402456 1402457 . 100 A G PASS AC=62275;AF=0.435654 1044 chr1 1409622 1409623 . 100 T C PASS AC=28172;AF=0.197438 2 chr1 1421169 1421170 . 100 A G PASS AC=36059;AF=0.25177 3 chr1 1439453 1439454 . 100 A G PASS AC=60818;AF=0.427459 1 chr1 1469428 1469429 . 100 A G PASS AC=41464;AF=0.310545 15 chr1 1490319 1490320 . 100 T C PASS AC=46872;AF=0.32763 28 chr1 1534165 1534166 . 100 G A PASS AC=102116;AF=0.725184 41 chr1 1543952 1543953 . 100 A G PASS AC=70415;AF=0.491745 664 chr1 1561820 1561821 . 100 A C PASS AC=70568;AF=0.493724 716 chr1 1575863 1575864 . 100 G A PASS AC=69681;AF=0.487914 4 chr1 1615321 1615322 . 100 A G PASS AC=90046;AF=0.628497 34 chr1 1623411 1623412 . 100 T C PASS AC=123604;AF=0.863808 98 chr1 1635542 1635543 . 100 T C PASS AC=101133;AF=0.886494 8 chr1 1662854 1662855 . 100 T C PASS AC=27234;AF=0.202592 37 chr1 1671994 1671995 . 100 G T PASS AC=51582;AF=0.360572 37 chr1 1703967 1703968 . 100 A G PASS AC=14703;AF=0.103522 19 chr1 1719367 1719368 . 100 T C PASS AC=113709;AF=0.8014 22 chr1 1731962 1731963 . 100 A C PASS AC=50148;AF=0.361068 8 chr1 1752460 1752461 . 100 G A PASS AC=54321;AF=0.379289 29 chr1 1892482 1892483 . 100 A G PASS AC=119290;AF=0.833659 5 chr1 1918304 1918305 . 100 G A PASS AC=20662;AF=0.144431 645 chr1 2186899 2186900 . 100 A G PASS AC=56019;AF=0.391254 8 chr1 2193732 2193733 . 100 G A PASS AC=29602;AF=0.206796 62 chr1 2308566 2308567 . 100 T C PASS AC=118951;AF=0.831081 40 chr1 2321319 2321320 . 100 A G PASS AC=58531;AF=0.409142 0 chr1 2352456 2352457 . 100 A G PASS AC=75069;AF=0.524643 5 chr1 2377642 2377643 . 100 T G PASS AC=61192;AF=0.427557 3 chr1 2395372 2395373 . 100 A G PASS AC=58606;AF=0.40958 355 chr1 2408760 2408761 . 100 T C PASS AC=106097;AF=0.74147 31 chr1 2509918 2509919 . 100 T C PASS AC=53764;AF=0.375951 54 chr1 2521129 2521130 . 100 T C PASS AC=114175;AF=0.797523 100 chr1 2547071 2547072 . 100 T A PASS AC=83589;AF=0.587034 0

methionine23 commented 9 months ago

I tried to use the TOPmed fasta and grch38 with same sm1.somalier in the output folder.

brentp commented 9 months ago

well, your data is pretty sparse. I think somalier will save, but not use sites with less than 7 coverage. Also, the output file is named based on the read-group SM, so that must be sm1. You can check with:

samtools view -H $bam | grep ^@RG
methionine23 commented 9 months ago

I didn't realize the SM for these received bam (while using --sample-prefix=). All solved now, and thank you again!