Open Han-Cao opened 3 weeks ago
Thanks for bringing this up. I'm hoping #1403 helps for the error handling at least. As for where this giant variant is coming, that seems like a bug somewhere else?
Hi @glennhickey ,
Thanks for the quick fix!
I have re-run the pipeline, and the same error occurs, so it should be a bug. I cannot share the whole gbz / vg files because of the research consent. Do you have any suggestions on what analysis I can do to debug this issue?
The error record can only be found in CHM13-based VCF, when I try to get the record by bcftools view -r chr13:72847648
, I got the error:
[E::vcf_format] Invalid BCF, the FILTER tag id=65 at chr13:72847648 not present in the header
[main_vcfview] Error: cannot write to (null)
However, I also searched for it in GRCh38-based VCF by chr13:73625000-73627000
(2kb flanking of liftover region), the error record cannot be found.
The code I used is :
# run cactus-graphmap-join
cactus-graphmap-join $jobpath \
--vg ${outpath}/chrom-alignments/*.vg \
--hal ${outpath}/chrom-alignments/*.hal \
--outDir ${outpath} \
--outName name \
--workDir $workdir \
--reference CHM13 GRCh38 \
--vcf --vcfReference CHM13 GRCh38 \
--vcfwave \
--viz --odgi \
--chrom-vg clip filter --chrom-og \
--gbz clip filter full --gfa clip full \
--haplo clip --giraffe clip \
--filter $n_filter \
--logFile log/cactus-pangenome/cactus-graphmap-join.log \
--maxCores 40 \
--indexCores 39
Thank you!
I cleaned up this part of the code in #1404, getting rid of the vcf-bubwave script altogether (and doing the logic directly in Toil). This should further improve error handling and resource allocation. If your issue is from a bug in vcfwave itself, it won't help you much but I'm hoping it at least fixes the Cactus end of things.
Thanks for the update. I have finished the whole analysis pipeline, but error record still exists in the raw CHM13-based VCF.
I would like to manually check the graph and assembly to see what is wrong with chr13. Is there any method can identify which haplotype harbors the variant >12942157>13476226
from the gbz graph?
Thank you!
By default the IDs in the VCF correspond to the GFA and not the GBZ which is confusing. As the link says, simplest to use --chop
to make sure they are the same.
Your best bet is probably running vg chunk
on the xg
output to look at your region.
I think if you have an invalid region in your VCF, then this is most likely a bug in vcfwave
. I'd suggest checking in the region in the non-vcfwave vcf and following up, if possible, on the vcflib github (though please continue to report anything suspicious you find in cactus here).
Thank for your suggestions. I will look into the xg output.
However, the error region should be generated by the MC pipeline:
.raw.vcf.gz
, which should be generated by vg deconstruct
. I didn't find any issue with all GRCh38-based VCFsvcfbub
will drop the error record, because its ref allele is larger than --max-ref-length 100000
. Therefore, everything is fine for VCFs that have been processed with vcfbub
vcf-bubwave.sh
previously has error is because the script first use bcftools view -r CONTIG raw.vcf.gz
to split the .raw.vcf.gz
into different chromosomes. The error record disrupts the VCF specification and make bcftools
produce truncated VCF.I will look into the error region in the graph and provide more information about it.
Oh, I see, thanks for clarifying. I'd understood the error was after vcfwave
. If it's in deconstruct
that's much more serious. It must be a size limit being exceeded somewhere that's somehow preventing the format tags from coming through or something.
I haven't heard of this before but, like you say, maybe it's happened and nobody noticed because they were only using the vcfbub
output which masks it. I just checked and the HPRC v1.1 graphs don't have this issue.
I guess you have an enormous site in your graph somehow and this is failing to deconstruct. I'd be curious if the site looks valid when you view it with gzip -dc
instead of bcftools
? Given that vg
can write it and vcfbub
can parse it, it seems most likely the limit is coming from bcftools
. But in any case, it sounds like vg deconstruct
should probably have a default limit to prevent these cases. And finally, I aggree you should double check your graph -- it does look like maybe there is a strange rearrangement happening...
Speaking of bcftools
. Do you know if there's a way to use it to quickly check that the VCF is invalid in this case. Maybe bcftools stats
? I can add that to the pipeline to make sure these errors are caught earlier, and even if vcfwave isnt requested.
thanks!
(Note: I made some mistakes in the previous analysis and have revised this comment multiple times, please ignore all the previous version of this comment if you already read it)
I did some analysis and I think there could be some issue with deconstruct
:
chr13.og
. There is a large bubble in this region. It seems many haplotypes cannot be aligned to the reference genome for this region. Moreover, as this error is only found in CHM13 VCF, I am wondering if it is possible that CHM13 and GRCh38 are in the different paths of the plot. Is there any method to check this?2, I extracted the alternative alleles of this record by zgrep $var_id vcf.gz | cut -f 5 | tr ',' '\n' | awk '{print ">alt_allele." FNR "\n" $0}' > error_alleles.fa
and mapped them to CHM13 by minigraph -t 40 -cxasm chm13.fa error_alleles.fa
. The first 12 columns of the paf
output is listed below. All the alleles are exactly mapped to the same region except for the last allele. However, as the last allele also start from the same position on chr13 and is much shorter than the others, I guess this is where the record is truncated
alt_allele.1 28775088 1 28775088 + chr13 113566686 72847648 101634483 28727232 28808098 60
alt_allele.2 28783652 1 28783652 + chr13 113566686 72847648 101634483 28724641 28817745 60
alt_allele.3 28787287 1 28787287 + chr13 113566686 72847648 101634483 28735578 28813045 60
alt_allele.4 28770863 1 28770863 + chr13 113566686 72847648 101634483 28724199 28809144 60
alt_allele.5 28781783 1 28781783 + chr13 113566686 72847648 101634483 28724377 28815735 60
alt_allele.6 28785067 1 28785067 + chr13 113566686 72847648 101634483 28730621 28815998 60
alt_allele.7 28781653 1 28781653 + chr13 113566686 72847648 101634483 28727304 28814033 60
alt_allele.8 28786875 1 28786875 + chr13 113566686 72847648 101634483 28736091 28811685 60
alt_allele.9 28783838 1 28783838 + chr13 113566686 72847648 101634483 28728763 28815085 60
alt_allele.10 28785444 1 28785444 + chr13 113566686 72847648 101634483 28729919 28815821 60
alt_allele.11 28791498 1 28791498 + chr13 113566686 72847648 101634483 28737503 28816330 60
alt_allele.12 28769034 1 28769034 + chr13 113566686 72847648 101634483 28724551 28806226 60
alt_allele.13 28762813 1 28762813 + chr13 113566686 72847648 101634483 28713289 28810449 60
alt_allele.14 28863551 1 28863551 + chr13 113566686 72847648 101634483 28732859 28893410 60
alt_allele.15 28779771 1 28779771 + chr13 113566686 72847648 101634483 28727122 28813498 60
alt_allele.16 28786090 1 28786090 + chr13 113566686 72847648 101634483 28733495 28813540 60
alt_allele.17 28747969 1 28747969 + chr13 113566686 72847648 101634483 28696149 28813859 60
alt_allele.18 28781828 1 28781828 + chr13 113566686 72847648 101634483 28731890 28812438 60
alt_allele.19 28775901 1 28775901 + chr13 113566686 72847648 101634483 28722923 28813487 60
alt_allele.20 28781342 1 28781342 + chr13 113566686 72847648 101634483 28728968 28813841 60
alt_allele.21 28776894 1 28776894 + chr13 113566686 72847648 101634483 28724942 28812698 60
alt_allele.22 28786983 1 28786983 + chr13 113566686 72847648 101634483 28731341 28815896 60
alt_allele.23 28785536 1 28785536 + chr13 113566686 72847648 101634483 28729816 28816808 60
alt_allele.24 28780467 1 28780467 + chr13 113566686 72847648 101634483 28726272 28813855 60
alt_allele.25 28775716 1 28775716 + chr13 113566686 72847648 101634483 28728814 28808355 60
alt_allele.26 28775338 1 28775338 + chr13 113566686 72847648 101634483 28719564 28814298 60
alt_allele.27 28779782 1 28779782 + chr13 113566686 72847648 101634483 28724134 28814814 60
alt_allele.28 28787742 1 28787742 + chr13 113566686 72847648 101634483 28731157 28816870 60
alt_allele.29 28773218 1 28773218 + chr13 113566686 72847648 101634483 28709266 28823654 60
alt_allele.30 28776311 1 28776311 + chr13 113566686 72847648 101634483 28724667 28811053 60
alt_allele.31 28782570 1 28782570 + chr13 113566686 72847648 101634483 28735326 28808466 60
alt_allele.32 28775617 1 28775617 + chr13 113566686 72847648 101634483 28719443 28816078 60
alt_allele.33 28779328 1 28779328 + chr13 113566686 72847648 101634483 28727660 28813250 60
alt_allele.34 28767089 1 28767089 + chr13 113566686 72847648 101634483 28722414 28806142 60
alt_allele.35 28777767 1 28777767 + chr13 113566686 72847648 101634483 28729586 28808926 60
alt_allele.36 28767276 1 28767276 + chr13 113566686 72847648 101634483 28714088 28815213 60
alt_allele.37 28776527 1 28776527 + chr13 113566686 72847648 101634483 28726793 28809746 60
alt_allele.38 28781219 1 28781219 + chr13 113566686 72847648 101634483 28728370 28811829 60
alt_allele.39 28857007 1 28857007 + chr13 113566686 72847648 101634483 28719018 28898203 60
alt_allele.40 28774567 1 28774567 + chr13 113566686 72847648 101634483 28722972 28811941 60
alt_allele.41 28793748 1 28793748 + chr13 113566686 72847648 101634483 28731981 28821469 60
alt_allele.42 28787854 1 28787854 + chr13 113566686 72847648 101634483 28737044 28812824 60
alt_allele.43 28776767 1 28776767 + chr13 113566686 72847648 101634483 28725788 28810700 60
alt_allele.44 28781969 1 28781969 + chr13 113566686 72847648 101634483 28727100 28815147 60
alt_allele.45 28776085 1 28776085 + chr13 113566686 72847648 101634483 28722839 28812953 60
alt_allele.46 28771838 1 28771838 + chr13 113566686 72847648 101634483 28725965 28806963 60
alt_allele.47 28764459 1 28764459 + chr13 113566686 72847648 101634483 28712803 28811349 60
alt_allele.48 28778129 1 28778129 + chr13 113566686 72847648 101634483 28733236 28806797 60
alt_allele.49 28772800 1 28772800 + chr13 113566686 72847648 101634483 28726128 28807854 60
alt_allele.50 28777441 1 28777441 + chr13 113566686 72847648 101634483 28724574 28811689 60
alt_allele.51 28778561 1 28778561 + chr13 113566686 72847648 101634483 28721830 28816024 60
alt_allele.52 28774594 1 28774594 + chr13 113566686 72847648 101634483 28728758 28807148 60
alt_allele.53 28779299 1 28779299 + chr13 113566686 72847648 101634483 28726324 28814334 60
alt_allele.54 28779005 1 28779005 + chr13 113566686 72847648 101634483 28714684 28825512 60
alt_allele.55 28774398 1 28774398 + chr13 113566686 72847648 101634483 28721435 28813909 60
alt_allele.56 28782799 1 28782799 + chr13 113566686 72847648 101634483 28729076 28813409 60
alt_allele.57 28779779 1 28779779 + chr13 113566686 72847648 101634483 28726687 28812691 60
alt_allele.58 17672661 1 17672635 + chr13 113566686 72847648 90523622 17639020 17694136 60
vg giraffe
, vg call
. I can share the sequences of the ref and alt alleles if you think they are helpful for debug.Regarding bcftools
, I just tested bcftools stats
(1.18) on the VCFs. When using raw.vcf.gz
, it will raise Command terminated by signal 11
, while using the vcf processed by vcfbub
can successfully output statistics.
Hi, thanks for the detailed follow-up. I agree that it looks like something is going wrong both in the graph (giant bubble) and in vg deconstruct
(invalid entry).
This does seem to be in a region that is more complex in CHM13.
This is presumably due to better representation in this complex region in the hprc assemblies and chm13 than grch38, which allows more alignment and complex graph structures.
But these visualizations are on the "full" graph. After the default minigraph-cactus clipping, the unaligned bits seem to disappear. Note that chr13.vg
is from here
vg chunk -x chr13.vg -S chr13.snarls -p CHM13#chr13:70000000-110000000 > chr13.region.vg
chr13.region.vg -b 10000000000 -t 64 > chr13.region.depth
cat chr13.region.depth | sort -nk 4 | head
HG00621#2#JAHBCC010000153.1#0 578 528957 88.3037 5.04057
HG01109#2#JAHEOZ010000195.1#0 2 526627 88.3183 6.08767
HG02145#1#JAHKSG010000325.1#0 189 526695 88.4065 4.8771
HG02572#2#JAHAOV010000392.1#0 1145874 1735671 88.4238 5.71329
NA20129#2#JAHEPD010000302.1#0 1 398903 88.4605 5.47383
HG02723#1#JAHEOU010000278.1#0 1 1513072 88.4677 5.7298
HG01243#2#JAHEOX010000207.1#0 1 525965 88.4713 4.37145
HG03486#2#JAHEOP010000252.1#0 2 525216 88.5117 4.31493
HG02630#1#JAHAOQ010000251.1#0 1 522955 88.5497 4.28433
HG03486#1#JAHEOQ010000218.1#0 1 525144 88.5553 3.76486
In other words there is no path going through this region that's aligned to fewer than 88 other paths.
Furthermore, it looks like most of the region aligns fine between grch38 and chm13
cat chr13.region.depth | grep GRC | awk '{sum += $3 - $2} END {print sum}'
39941468
cat chr13.region.depth | grep CHM | awk '{sum += $3 - $2} END {print sum}'
40000166
Perhaps you can use odgi viz
and / or vg depth
in this way to figure out which haplotypes are forming the bubble in your graph?
About bcftools stats
: thanks for checking. I will add this to the pipeline to make sure output VCFs are valid.
Thanks for the suggestions and help! I did similar analysis on my data.
There are some contigs in this region are not covered by many haplotypes (CHM13 and GRCh38 are listed at the end for reference). Here, I normalized the "depth" to (depth - N_haplotype) / N_haplotype. For HPRC1.1, the normalization equals to (depth - 88) / 88. Is there any method to check if these paths formed the giant bubble in the graph?
contig | start | end | length | (depth-N_hap)/N_hap |
---|---|---|---|---|
TAD472#2#h2tg000455l#0 | 1 | 370883 | 370882 | -8.7% |
GA23228#2#h2tg000324l#0 | 1 | 52904 | 52903 | -2.8% |
TAD573#2#h2tg000480l#0 | 1 | 52268 | 52267 | -1.7% |
GA22234#1#h1tg000235l#0 | 2 | 154518 | 154516 | -1.2% |
GA23207#1#h1tg000288l#0 | 1 | 151833 | 151832 | -1.0% |
CHM13#0#chr13 | 69999862 | 1.1E+08 | 40000154 | -0.1% |
GRCh38#0#chr13 | 1 | 24513716 | 24513715 | -0.1% |
GRCh38#0#chr13 | 52366383 | 67793650 | 15427267 | 0.1% |
I also calculate the number of bases aligned to this region for all haplotypes by cat chr13.region.depth | grep xxx | awk '{sum += $3 - $2} END {print sum}'
. However, the distribution looks normal, and I didn't find obvious outliers in this region.
Huh, I would have thought there'd be one that stood out with depth
. Another idea is to run vg chunk -S
(example in my previous message), convert the output to GFA (vg convert -fW
), and view the resulting graph in BandageNG. This should show you your bubble. It will allow you to click on nodes in the bubble, which you can use with vg find -n | vg view -
to see which path they are on. This should at least tell you once and for all the paths that are going through each side of your bubble.
As an aside, this patch #1421 resolves a related error. It seems to be different from what you're seeing, but similar enough that it may be worth trying out all the same. I will make a Cactus release for it soon.
Thank you so much for the great suggestion! It seems one haplotype causes this issue.
I generated the gfa using vg chunk -x chr13.vg -S chr13.snarls -p CHM13#0#chr13:70000000-102000000 -t 40
, vg convert -t 40 -fW
, the visualization at the error site is:
There is a giant loop in this region due to the link between 12942158 and 13476225 (in red). The position of the labeled nodes is listed below. The distance between these 2 nodes is ~28.7Mb, which is the size of the giant allele in VCF. And their position on chr13 also closed to the variant in VCF.
Segment | CHM13 | pos |
---|---|---|
12942157 | CHM13#0#chr13[69999861] | 2847758 |
12942158 | NA | NA |
13476225 | CHM13#0#chr13[69999861] | 31634372 |
13476226 | CHM13#0#chr13[69999861] | 31634622 |
I then generated a smaller gfa around 12942158 by vg find -n 12942158 -c 10 -x chr13.error_region.xg | vg view -
to look into this site. I found that only GA22163#1#h1tg000063l
harbors the link >12942158<13476225
. Moreover, most haplotypes have 13476225
while only GA22163#1#h1tg000063l
has 12942158
.
GA22163 1 h1tg000063l 13640679 13642707 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942153<12942155>12942156>12942157>12942158<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211
I used minigraph
to this sample to CHM13
. It seems there is a large inversion (72Mb - 101Mb on chr13) at this region, but this inversion is not fully covered by 1 contig: h1tg000063l
has 56Mb-72Mb-101Mb-99Mb, and h1tg000029l
has 99Mb-72Mb-101Mb-113Mb
h1tg000063l 18848877 301 16487557 + chr13 113566686 56364856 72847656 16445115 16510692 60
h1tg000063l 18848877 16487576 18848557 - chr13 113566686 99272532 101634483 2357061 2363431 60
h1tg000029l 37440300 11036220 37440275 + chr13 113566686 72847673 99261873 26359869 26435112 60
h1tg000029l 37440300 2274 11007249 - chr13 113566686 101663468 112675390 10961881 11041557 60
h1tg000029l 37440300 11007269 11036192 + chr13 113566686 101634494 101663431 28895 28942 60
I also found that some of the other haplotypes (including CHM13) are linked by >12942172>13476211
(in blue). But I cannot see the link in BandageNG. Interestingly, this can only be found in CHM13 but not GRCh38. I guess this is why only the CHM13-based VCF has error.
S 12942172 CGTGTTTCTGATTGCCTTTTAGTGGGCATGTGCTTAGACCTCAAAACTCACTCTATGTGCAGAGCTTCCTGTAGATTGGGATCTCCTTCACAGAAATGCTGTGTCTCTCTGCGGATGCAGCACACAGGAAGAAGTCACCCACTCGCTAC
S 13476211 T
W CHM13 0 chr13 69999861 70002649 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172>13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W GRCh38 0 chr13 0 1763 >13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W GRCh38 0 chr13 52366382 52367407 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172
My questions are:
>12942172>13476211
(blue nodes in the plot) is also an issue need to be fixed?Below is the gfa around 12942158, including CHM13, GRCh38, and the sample potentially with issue for your reference.
Thank you so much for your kind help and patience!
H VN:Z:1.1 RS:Z:GRCh38 CHM13
S 13476221 A
S 12942131 T
S 13476234 T
S 12942166 AACATCCCAGAGCTGCAGGTCAATGTCAAAGAGTCCCCAGACAGCCAAGGGTTGTGTCTAAGTCAGCATGGCTGAACCAAGTTCAAATCAACTGGATGAGCAGATAATCTCAAGTCTTGGAACTGATTAAAGGAATCCTTGTTTCTAGTGCCCCCAAAAAGGAAAAAAAAAAAAAAAAAAAAAACCTTTCCAGAGGAAGATAATGTTATCTTTTCCCTCAAAATATTTTTCAGATAACTTCATCACTTGTGTCTTCAGTGAGTAACATACTATGACAGACTACTGGAATATTTTCAGTAATTATAGCAGAATGGGCTGGGGAGGAGTATGGAGAAGCATACACCAGCTTTAAAGCCTTCCTCTTAGACGTGAAACACATCATTTCCACTCACACTACATTGCCTAAGGTAAATCTCATAACTGCACTTGAGCTTGATAGAGTTAGGAAATGCAATTTCTCCAATAAAATAGAAACCAGAAACGGATAAGCAGTCATGCAATTTTTCAGGGCCTGCCAAATATTTATTTTCCTCTTCTTTCTCAGTGCAAATCACACTCAACCTCTCCCCAAAGGACATATCCAAAATTTCCAACCAATCACTACTCTCAAAGTTCAGGATCTCCAAGTGACGCACAACAGCCTCTACATTGGGTCTAGATCTGACTTTTGATCTGGAGATTTGATAACTAGAGACAAGGTATCGGCCCTTTACACTCCAAAAAGGCAATGGGAAGGGATAACACCTGTACACTATCCTCTCCTGAAAGAGCAGAAATAGGAGAAGTCACTAGTAAATAACAATTCCAAAGTTACGCAGAGCATATGTGCTGAGGGGCTCGTATTCAGGGTGAGATGTGGTCTCCTCTCTGACATGACCTTCCCAGCCCATTGCTCTCTGTGCCCCTTTGCTCAGCTCTCTGAGGGGTCCTCCTTTTGTCCATTATCGTCCTTAACCACACCTTGAGAGAGCATTGAAGAAAACGCCTTCCTTGAGAGCTGAGCAGCTTTCTTAGCCAGCTTCCTTCTAGCAGAAGCATGGGTGCCTATGTGGCTTTTTAAGTACCAGCTACAGCCTCTTTTAGTCCAAGTGATGGCCCTTGGCACAACAAAATTCTCTCAAAAATCTAGATAGTTTATTACCTATTTTATTCCAGGCAATTTTCTTGTTCCATGTCCCACACCTAACATACATGCATGTATATTGTGTTTATGAGACAGAGATAGGAGAGTTACATGTTTGTCTGTTTTTCTTCAGCCATTTTATCCAACTAAAAGAATTTACTGGGAAAACCTTAATCTCTTCAGAAGTTTTAACAATGAATATGGCAACCATGCTCTTATTTTATCCTCATCTTAACGCTAAGTTTTGATTAGCCTCCACATTTATTTGTATTTTATTTATTTAAAACTCCCCCCTTTTTTCTTAGTCTAGCTGAGGGTTTCTCTATTTTTTCAAAAAACCAACGCTTATTTTTGTTGATTTTTTCTATAGATTTTCTAGTAGAAATAATTCTGCCTGATCTTTATTATTTCCTTCCTTCTGCTAACTTTGGATTTAGTCTGCTCTTCTTTTCCTAGTTCCTCGAGATGTAAAGTTAGGTTGTATTCCAAAAATTCCTCTAGGTTTCTGTGTCAGTGGATGATGGAGCAAAAAAAAGAAAAAAAAATTCCTCAGTTTTATCTTTTAATAATTAGGTTTAAGAAACAGTTCTCTTTTCAATCCTGCAAGTACTTGAATTTACATTACCGCTATTCTTTTCCATTCTTGCTTCTTGCTTGCAAATTAGTCAGTTTTGTTCTGATTTTGTCTCTTTTAAGATCTCTGTACCCACTTCACAGATGAAGAAAAATAACTATCCAGGCTGGGCACAGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTCAAGGCGGGCAGATCACTTGAGGTCAGGAGTTGGAGACTGGCCTAGCCAACATGGGGAGACCCCTGTCTCTGCTAAAAATAAAAATAAATTAGCTAGGCGAGGTGACGGGCACCTATAATCCCAGCTACTCAGAAGGCTGAGGCATGAGAATCGCTTGAACCCGGAAGGTAGAAGTTGCAGTGAGCTGAGATTGCACCACTCCATTCCAGCGTGGGCGACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAATGAAACAAAAAAGGCCAGGCCCGGTGGCTCACCCCTGTAATCTCAGCACTCTGGGAGGCTGAGGCAGGTGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTGAAAATACAAAAATTAGCCAAGTGTGGTGGCGCATGCCTGTAATCCCAGCTACTCCGGAGGCTGAGGCAGGAGAATCGCTTGAACCCCGGAGGCAGAGGCTGCAGTGAGCTGAGATTGCATCAGCACACTCCAGCACTCCAGCCTGGGTGATAGAGTGAGACCCTGACTCAAAAAAAAAAAAAAAGAAAAGAAAAGAAAAGAAAGAAAGAAAAAGAACTCTCCAGCCACTCCAGAAGGCCCCCAGAATGTCCAGGACCAACCACAGCTCTCTACCTCCATGAGGAATTACTCTCACAGCTTTTAGAGGACTCATTTCCTTACATTATTACTGTATTGTTTTGTTATCCTAATTTACATCCCTAGATACTGTAGTTTAGCCTTCTCATTTCTTAAAATTGATACTGTAGCAGGACGAGCCGCAGACCAAACCTCTCAGACACCTAGTTGTAGAAGGAAGGGCTTTATTCAGCTGGGAGCATCGGCAAGTTACTGCCTTAAAATCCGAGCTTCCTGAATGCACAATTTCTGTCCCTTTTAAGGGCTCACAACACTAAAGATTTCACATGAAAGGGTCATGATTGATTTGAGCAAGCAGGTGGTACATGACAGAGGCTGCATGCACCAGTGGTCAGAGAGAACAAGAACAGGGCAGGGAGTTTCACAGTGTTCTTCTATATAATGTCTGGAATCTATGAATAACACTGGTTTCTAAGTTATGAGTTGATTTTTAACTACTGGGTTTAGGCCAGGCATGCTCAGACCTGGTTTCTGGCCTGGCGCCAGGCTGCCTGTCTTTGGTTTTACTTCCTTGCTGTTTTTTCTTAAAACAGGTACTGAGTATAAAACAATATGAGAGGGTCTCTGTCTTCCTTCAATACCTTTAACTTCCTTTCAATCTATCTTTTTTCTTCTCTCCTAACTCATTTGTTGGAAGAGCCTACACTTGTGAGCTGTAGTTTCCCGTTGCTTGGCTTTTGCTAAGTGAGTTCTCATGCTTCAGGGCAACATGTTGCTTTTTCCTGCATATTGGCAGCTGGATCCAGAGTCTTGATCAGAATCAATTTGATCCCTTTGGCAAGGTTAGAGTGGTGGTGTGTTCATTCATCATGAAGCACATTATGTCTTTCTCTTTCTTGATGTTAGCAGCCTTTGATGCTCAATGCTCAGGACCCTTTAATTCACAGGAGTTTACAAATGGTAATATTTTAATTCTATCATTTTGTTTTCACTTATTGGCTGTAATAATTTTGTAAAACGACACTTCTGCTCATCTGCTACTTGATTACCCAGGGATAGCGGCTATATAGAAAAGGCAGGATACATAGATTGATATTCTCTTGTTGATAGTCATTACAGACTCTTAAGTTTAAACATATTTGATAGTTTTTAACCCATTGAAATCTGTATCTTTATTAAGGTTCAAATTATCCTACCCTTGGTGAATACAAGCCTCTTTGAATTGACTCTTGTGTCCTTTTGACACAAACCTAGTAGCTTTTTATAGCTTCTATGCTATCTGGTATGTCAAAATGTCCCAGTTTCATTTTGTACACTTCCTGTCCACAATCAATGTTCCAAGAACCCCTCTTTTTAAAAAATGGAAATGATCTTTGAGACCATAGTCTGGATGCTGTGAATGTGTGTTCATAGTGAGCTGGTCATTGTTTCTTGACCTCTTTATGGATAGAGCTAGAAATTGCACACACACACACACACACACACACACTCCTTTTTCTTTTCCTAAACATGTCTTTCCAGCATGCCCCTTCTTTTCTCTTTGTTATAGCATGGGATTCATCACAATATGTTTTTATTGCTTCTTTTTATTTTCCCAAATTTTAGAAGAAAGATTGGTTAGCTAGCTCTACTAACTTTTTTTCAAGCTCACAAAATTTTCTCTTCTCTTGTTTTTATGAGATTTTAAAAATAGCACCTTGTTTTGGGAGTTTTTCTTACTTGTTTTCCCACCTCACTATAATCTTGACCTATTTTTTCCTGTGGGATTTTTTTCTATATCCTAATTTCTGACTTGAAATTGCACTGTCAAGAAAATAATTTACACGTAGTATGATTTTATAAGCAAAGATGCTCATTGCTGTGTTTTTAACTAGAAAAAGTTAAAATGTTTTTGTTATTATTTCTAAAAAGGGGGAAATTGTTAATTTGTGTTCAAAACCAGAGGAATAATTAGACAAATCATTGCTGTTATTAAAGACTTACCTGGAAAAAATTTGGAAGTATTTCTCCCTGTGTCAAATCTTAAAGACATGGCTCATTGAATCAAACAAGGAAACCACCCACATTGATTTATAAACATAAATTAGTATTACTGAATGGTAGTTTGGGAGAAAAATATTTTCTAATAAAGGATTAATGCTGCAAGTTACCTAGTACCTAGTTCTTCCTTACCTTCTGCTATTGGTATGGAAATATTAGATCTCCTTGAAATAGACTGGTTGCTATTCATCTTAACCTTTAATATATCAGACCCAAATTTCTTTATACTGACAGTTCAACAAGTGAAATTAAATAATTAAATTGACTTAATCTGATTAACATTTTTTTCTTTCTACATATCTTCAATAGCTTTACTTTAATGAACTCTCTATAGAAGCCCTCCAACATTTCTGGTGTCAGACATTACGGACAACTGAATATATCATGAGTAGGTCTCATAGGGAGAATGTGGAACCTGTTCTTCCCTTCTTTCAGGCTTTAGGGAAAGAGTGAAAAATAAGAATAAATTCTGACATACTAAACAAGACAGCATATAAAATAGAATATTCACAAATCATTTGGTTACCTTATTTTTTAAACCTCATATAGCCAAGACAAATTTTAAACTAAGAAGTACTAGTGGGGATTTGCAGGGATCGTGGCGGATGGGAGGCAGGACTAGATTGCAGCTCCGGACAGAGCAGCATGTGGAGGCTTGCAATGTGAATTTTAGCTCCAGATCAACTGCAGGAACAAACTAGCAATCCTGAGAGGAGCCACAGACCCTCTGAAGGAAGCTGACTGCTCCTGCAGGACCTCGAAGACACCCCAGATACTATGAGTAACCCCAACTACAGAAGTGGGAAAGGGAGGCCCTCCTCTCCTGAAAACACACCCCCACTGGAGAAGCTGAAGGTCTGTTTGTGGGAGAAGTTTCTGACTTTACCTGGAGTTGAGTCAAGTTAGAGAGCCGACCCCAGCGAAATACGGGAGTAGAGGAAGCAGCAGAAAGGCCCTGGGAGCTCGCTGGGTCCCCAAGCAGCCCATTCCTGCCTGGCACCACAGGGATCCATAGGGAGGGTGGCCAGAGGAACAGGGGGTAAAACTCCACAGGGGGTTTTTAGCTGGCCTTTGTAACAATTTGAACTGGGCAAGAAGCCTCCTGGCCAGAACTTGGGGGAGAGTGCGAA
S 12942134 A
S 12942148 T
S 13476238 TTGAGGTCAGGAGTTCGAGACCAACCTTACCAACATGATGAAACCCCATCTCCACTAAAAAAATACAAAATTAGCCAGGCATGATGGCAGGCACCTGTAATCTCAGCTACTTGGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGCAAAGGTTACAGTGAGCCGAGATCACGCCATTGCACTCCA
S 13476226 AAAGCATAGCCAAAGTGTTAAATGGGCAGCATTTTATATTATA
S 12942139 ATATTCACAACCAATACAGTGATAGGATCTTAAGGTTGTAGATGCATACTGTGAAGCCAGTCTCCCCTGACCCCTGCACATGATTCCTAAATCAGGTTACAATAAAATTGATTTGTAAACTATTTCCATAGTCCACTAATGGCAAATCTGGCATCGTTTGCTTTTTAGTGCTGTCTTTAAAGATACATAGGAGCAAGACACCTCTATGAAATGAGGTGCTACAAGAAGGAAATTTGGAGCTTGAAGAAAAGAAAGGTCTCCAGAAACTATAGCACGGACAAGACTGAAATTGTTTTAAAAATAGCTAATTCTTAAAAACTGAATCTAAACTCTGCTTGTGAAATGCTCCTTGGAGCAGTATAACAAGTAAAAGTCCAGTAAGACCACAAAGGTTCCAGCGAACACCACCAGGAAACTCC
S 13476212 G
S 12942165 C
S 13476222 TGTAAAACTTTTCCACCATTTAACAAGTGAACATTAGAAAAAAATCCAAGATTATTAGATTGTCTCATTTCTTACAACAGTTCTACAAATTGCAAATTGGTGATCCATAACATTTGTGCATATATACTAAATGATTTTAATAGTAATATCCATGACACTTTTGATAGTTTGCTTCAAAAACTGGAGTTACAGA
S 12942146 T
S 12942141 AGCTTGTTGAACCTTGGCAGCTCCTGTGAAGGAGATCCCAGTCTACAGGAAGCTCTGCACATAGAATGAGACGGGCTTCACTGCATCTCTTACTCTTCTTGGGGTCCACTGTGGCTTCTAAACTAAAGGTGCACCTGAAGTAAGTTCTATGTGTATTTTGATAGGGATTAATTTAAAATAAACTCCAGAAGCATTAAAACAACCAGCACAGAAAGTTTGTCAATATCCCCGTTAATATCTGATTTTAAAGATTCAATTTCTTTTCTATGTTTACAGGCCAATTTTTAAA
S 12942143 G
S 13476224 C
S 13476228 T
S 12942149 G
S 12942147 T
S 12942151 C
S 12942168 T
S 12942142 T
S 13476215 G
S 12942162 T
S 13476232 GGCCAAGAATATAAGCTCAGATTCTGTTAAAAATATTTCTCAAAGGCTGGGC
S 12942135 G
S 13476237 C
S 13476220 G
S 12942164 GA
S 12942158 TCCTACCTTTGGCAGGAAAATATTTTG
S 13476214 T
S 12942153 T
S 13476236 A
S 12942171 ACTTCACAGGAAGGGGAAGAACTAAAGCCCTTTTCTTTCGCAGCTGGAAGGAGGAAAGCCTCAGGAAAGTTTTCAAGCTCTTCTTTGCCCTCCACCTGGAAGCAGACTGTTGGGGGAGGCATGGTGGGAGTGAGACCAGCTCTTATGTCTTTGCATGGGAGCTGGGTGAGGCCTGTGACTGCCAGCTTTCCCCTATTTCCCTGACAATGTGCATGACTCAGCAGAGGCAGCCATAATCCTTCTGGGTACACAACTCCAGTGGCCTGGGAATCTCACCCCCATCTCCCACAGCAGCTGCAGCAAGACCCACCCAAGGATAGTCTGAGCTCAGACACGCCTAGCCCTACCCCCACCTGATGGTCCTTCACTATCCACCCTGGTAGCAGAAGACAAAGGGCATTTAATCTTGGGAGTTCTAGGGCCCTGCCCACCATCTGTGCCTCTCCACACTACTATAGCTGATGCTCTCTAGAAAGCACCACCTCCTGGCAGGAGGCCAATCGGCACAAAAATAGAGCATTCAACCACCAAAGCTAAGGACCCTCACAGAGTCCATTGTATCCTCTGCCACCTACACCAGAAGAGGCACTGGTATCCACGGCTGAGAGACCCATAGACGGTTCAAATCACAGGACTCTGTGCAGACAACCCCCAGTACCAGCCCAGAGCCGGGTAGACTCGCTGGGTGGTTAGACCCAGAAAAGAGACAACAATTATTGCAATTCGGCTGACAGGAAGCCATATCCATAGGAAAAGGAGGAGAGTACTACATCAAGGGAACATTCTGTGGGGGAAAAAACAAAACAAAACCTGAACAACAGCCTTCAGCCCTAGAACTTCCCTCTGACAGAGTCTACCCAAATGAGAAGGAACCAGAAAACCAACCCTTGTTATATGACAAAACAAAGCTCTTGAACACCCCCAAAAATCACACTAGTTCACCAGCAATGGATCCAAACCAAGATGAATTTCCTGATTTACCTGAAAAACAATTCAGGAGGTTAGTTATTAAGCTAATCAGGGAGGCACCAGAGAAAGGCAAAGCCCAATGCAAGGAAATCCAAAAAATGATACAAGAAGTGAAGGGAGAAATATTTGAGGAAATAGCTAATTTAAAGAAAAAAACAATCAAAACTTCAGGAAACATTGGACACACTTACAGAAATGCAAAATGCTCTAGAAACTGTCAGCAATAGAATTGAACAAGTAGAAGAAAGAAATTCAGAGCTCGAAGACAAGGACTTTGAATTAACCTAATCCAACAAAGACAGAGAAAAAAGAATAAGAAAATATGAACAAAGCCTCCAAGAAGTCTGGGATTATGTTAAATGACCAAACCTAAGAACAATCGGTGCTCCTGAGGAAGAAGGCAATTCTAAAAGCTTGGAAAACATATTTGGTGAATAAACGAGGAAAATTTCCCCAGCCTTGTTAGAGACCTAGACATGCAAATAAAAGAAGCACAAAGAACACCTGGGAAATACATCGCAAAAAGATCTTCGCCTAGGCACATTGTTATCAGGTTATCCAAAGTTAAAACGAAGGAAGGAATCTTAAGAGTTGTGAAACAGAAACACCAGGTAACCTATAAAGGAAAACTAATCAGATTAATAGCAGATCTCTCAGCAGAAACCCTACAAGCTGGAAGGGATTGGGCCCTATCTTCAGCCTCCTCAAACAAAACAATTATCAGCCAAGAATTCTGTATTCAGCAAAATTCAGTATCATGTATAAAGGAAAGACATAGTCGTTTTCAAACAAATACTGAGAGAAATAATTCACCACTATCAAGCCACCACTACAAGAACTGCTAAAAGGAGCTCTAAATCTTGAAACAAATCCTGGAAACACATCAAAACAGAAACTCTATAAAGCATAAATCACACAGAACCTATAAAACAAAACTAAAAGTTAAAAAGCAAAAACAAAAAACAAAAAAATCCAAAATTCCCAAACAACAAAGAGCATGATGAATGCAACCCTACCTCACATTTCAATAGTAACATTGATTGTAAATGGCCTAAGTGCTCCACTTAAAAGATGCAGAACCGGCTGGGTGCAGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTCAAGGTGGGTGGATCACCTGAGGTCAGGAGTTCGAGACTAGCCTGGTCCACATGGTGAAAACCTGTCTCTACTAAAAATATAGAAGTTAGCCAGGCATGGTGCAGGTACCTGTAATCCTAGCTACTTGGGAGGCTGAAGCAGGAGAATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGACCGCACCATTGCACTCCAGTCTGGGCAACAAGAGCAAAACTCTGTCAAAAAAACAAACAAAAAACAACAACAACAACAACAAACAGATACAGAACCACAGTATGGATAAGAACTCACCAATCAACTATCTGCAGCCTTCAGGAGACTCACTTAACACATAAGGACTCACACGAACTTAAAGTAAAGGGGTGGAAAAAGGCATTTCATGCAAATGGACACCCAAAGTGTGCATGAGTAGCTATTCTTATATCAGACAAAACAAACTTTAAAGCAACAGCCATTAAAAGAGACAAAGAGGGACATTATATAATGGTAAAGGCCTTATCCAACAGGAAAATTTCACAATCCTAAACATATATACATCTAACACCGGAGCTCCCAAATTTATGAAACAATTACTAATGGACCTAAGAAATGAGACAGACAGCAACACAATAATAGTGGAGGACTTCAATTCTCCACTGACAGCAATAGACAGCTCATGAAGCCAGAAAGTCAACAAAGAAACAATGGATCTAACTATACCTTTGAACAAATGGACTTAACAGATATATATGGAGCATTTCATCCAACAACTGCAGAATACACATTCTATTCAACAAGGCATGGAACTCTCTCCAAGATAGACCATGTGATAGGCCACAAAACAAGTCTCAATAAATTTAAGAAAATTGAAATTATATCAAGCACTCTCTCAGACCAGAGTGGAAAAAAAACTAGAAATCAACTGCAAAAGGGACCTTTAAAACCATGCAAATACATGGAAATTAAATTACCTGGTCCTGAATGAGCATTGGGTCAAAAACGAAATCAAGATAGAAATAAAAAATTATTTGAACTGAAAGACAATAATGACACAGCCTATCAAAACCTCTGGGATACAGATAAGAGGAAAGTTCATAGCCCTAAAGGCCTACGTCAAAAAGTCTGAAAGAGCTCAAACAGACAATCTAAGGTTACACCTCAAGGAACTAGAGAAACAAGAACAAACCAAACCCAAACCCAACAGAAGAAAAAAATAACGAAGATCAGAGCAGAACTAAATGAAATTGAAACAAAAAAATACAAAAGATAAATTTAAAACCTGGTTCTTTGAAAAGATAAATAAAATTGATAGACCATTAGCAAGATTGACCAAGGCAAGAAGAGAGAAAATCCAAATAACCTCACTAAGAAACAAAACAGGAGTTACTACAACTGACACCACTGAAATACAAAAGATCATTCAAGGATACTATGAACACCTGTATGCGCATAAACTAGAAAACGTGGAAGAGATGGATAAATTCCTGGAAAAATACAACCCTCCTAGCTTAAATCAGGAAGAATGAGATACCCTGAACAGACCAATAACAAGCAGTTAGAGTGAAATGCTAATTTAAAAATTACCAATAACAAAAAAAGTCCAGGACCAGACAGATTCACAGCAGAATTCTATCAGAGATTCAAAGCAGAATTGGTACCAATTCTTTTGACACTATTCCACAAGATAGAGAAAGAGGGAACCCTCCCTAATTCATTCTATGAAGTCAGCATCACTCTACTACCAAAACCAGGAAAGGACACAACCAAAAAAGAAAACTAGAGACCAATATCCTTGATGAACATATATGCTAAAATCCTTAACAAAATACTAGCTAAGTGAATCCAACAACATATCAAAAAGATAATCCACCATAATCAAGTTGGTTTCATACCAGGGATGAAGGGATGGTTTAACATATACAAGTCAATAAATATGATACACCGTATAAACAGAATTAAAAAGAAAAATCACATGATTATCTCAATAAGATGCAGAAAAAGCATTCAACAAAATCCAGCATCCCCTTATGATTAAAACTCCCAGCAAAATTAGCATACAAGGGACATACCTTAATGTAATAAAAGCCATCTATGACAAACCACAGCCAACATAATACTGAATGGGGAAAAGTTGAAAGCATTGTCTCTAAGAATGGGAACAAGACAAGGATGCCCACT
S 12942152 G
S 12942169 C
S 13476219 TTAACAGCATTTGTTGAGGTTTTTTTCTATAAAAATTGATTTGACTTTTAAGCAAAAGTATATTAGTTTTAAAATTGTGTTTTCCAAGTTAAATTTACTAAATCTTGTGTCAATAGAATATTATTAATATTTAATTCAAAGAGTGTTGCATAAACTGAGTGATTATTCTAAATTTATTAATTTCAAAAATAATTCTTTGAATTTTCCTTAAAGATTTTGAAGCCAGTTTACATAATGTTGTTAACAATTAACATATTCTTTATGCTGATTTATGCTAAAATATATGAAATAATTATTGCTCATTTGTATGATGAAAAGTTTTAATTTCAACATTTCAACATATATTTTTGATTTTGTCTAGTTAGGTCAGAAATTAGCTTTTCCTTTCCTTGGAACTTCAAATTTAGCTCATTAATGTGCAGTATAATATTAGTGAGAAAACAAGTTACTGTCATTTTTTGTCTTTGACTATTAAACATTTAGCAAGCATTTCTTTTGCTTTAAGAAAATACAGAATTAGATTCAAAAGTATAGTAAATCC
S 13476218 C
S 12942167 T
S 12942150 T
S 12942133 ACATAACAGCCTGAATTAGTTACATAATAAAAGTGAGAACAAATTTATAAA
S 13476225 ATTTTCAATATGTGTCATGCAGTAGAATGAAGCAATTAGAGAAATGTCAGGCCCACATTAAATATCAGTAAATCTTTATCATTGACCTAACATAGCTGGAACATTGACAACTGTGATAGAACAAATTTTTTATATCTATCTGAAATCATTCTTTGAAAAATATGAAAGTTCCAAAAATATTAATGACAGGAGTTAAATTTTTTAGAATGCACATTGAAATTTCTTTAAAAAAACTGAAATATTTTGAGAC
S 12942137 T
S 13476227 C
S 13476223 T
S 12942159 T
S 13476216 ATTAATTCATAAAATTACAGTTGAAGAAGCAGATTCATATACCCAAGATGTTCAAAATACACTTAAACACTTTCCAGTAACTTTAACTACTAAGAATCATTTTCCATTAATAATCTCACTCAC
S 12942144 CTC
S 12942161 TG
S 12942140 AGTGATAACCCCA
S 12942138 C
S 12942172 CGTGTTTCTGATTGCCTTTTAGTGGGCATGTGCTTAGACCTCAAAACTCACTCTATGTGCAGAGCTTCCTGTAGATTGGGATCTCCTTCACAGAAATGCTGTGTCTCTCTGCGGATGCAGCACACAGGAAGAAGTCACCCACTCGCTAC
S 12942154 T
S 12942156 T
S 12942145 T
S 13476229 AACTCATCTAAAGCTTAACTATAAACCTGAAATGTTTAAAATTTTGAATCAATTAATCATTGCTATTGTTAGAAACATCTTGTATTCTCCAGGTGATTTTTTGGTAGCTTAATTGAAGATCTTTCCACTTTTGTAAAATAATTCTTTGAGTATTTTTCTTTTAACTTTCTAACCAAATTTACATAAGTGAAATAGTGATTTATTTTGCCATCTCTTCATCACAATATTGTTTTGTTTGGGTATGTGTGTTCAAAAGTATAGGCTTCTTTATAGG
S 13476233 G
S 12942157 TTTTTTTTTTTTTTTTTGAAAAGCAGGTT
S 13476217 A
S 12942160 CC
S 13476230 G
S 12942136 CAATTTATGTTTGTTTATATTCTGGATGTAATTCACACTTA
S 12942170 TGGCATGCAG
S 12942155 G
S 13476211 T
S 13476235 CAGTAGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCTGAGGTGAGTTGATCA
S 12942132 G
S 12942163 AC
S 13476231 A
S 13476213 TTAAAATGCTAACTTTTTATCTGAACTATTTGATTTA
W CHM13 0 chr13 69999861 70002649 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172>13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W GRCh38 0 chr13 0 1763 >13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W GRCh38 0 chr13 52366382 52367407 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172
W GA22163 1 h1tg000029l 11026228 11036363 >12942159>12942161>12942162>12942164>12942165>12942166>12942167>12942169>12942170>12942171>12942172
W GA22163 1 h1tg000063l 13640679 13642707 >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942153<12942155>12942156>12942157>12942158<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211
W GA22163 2 h2tg000014l 0 1020 <12942172<12942170<12942168<12942167<12942165<12942163<12942162<12942160<12942159<12942157<12942156>12942155<12942153<12942144<12942142<12942141<12942140<12942139<12942137<12942136<12942134<12942133<12942131
W GA22163 2 h2tg000028l 10682061 10683824 <13476238<13476236<13476235<13476233<13476232>13476231<13476229<13476227<13476226<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211
L 13476221 + 13476222 + 0M
L 12942131 + 12942133 + 0M
L 13476234 - 13476235 + 0M
L 12942166 + 12942167 + 0M
L 12942134 + 12942136 + 0M
L 12942148 + 12942150 + 0M
L 12942148 + 12942152 - 0M
L 13476226 + 13476227 + 0M
L 13476226 + 13476228 + 0M
L 12942139 + 12942141 + 0M
L 12942139 + 12942140 + 0M
L 13476212 - 13476213 + 0M
L 12942165 + 12942166 + 0M
L 12942165 + 12942167 + 0M
L 13476222 + 13476223 + 0M
L 13476222 + 13476224 + 0M
L 12942146 + 12942147 + 0M
L 12942141 + 12942142 + 0M
L 12942141 + 12942143 - 0M
L 12942143 - 12942144 + 0M
L 13476224 + 13476225 + 0M
L 13476228 + 13476229 + 0M
L 12942149 - 12942150 + 0M
L 12942147 + 12942148 + 0M
L 12942147 + 12942149 - 0M
L 12942151 + 12942154 + 0M
L 12942168 + 12942170 + 0M
L 12942142 + 12942144 + 0M
L 13476215 - 13476216 + 0M
L 12942162 + 12942163 + 0M
L 12942162 + 12942164 + 0M
L 13476232 + 13476233 + 0M
L 13476232 + 13476234 - 0M
L 12942135 - 12942136 + 0M
L 13476237 - 13476238 + 0M
L 13476220 + 13476222 + 0M
L 12942164 + 12942165 + 0M
L 12942158 + 13476225 - 0M
L 13476214 + 13476216 + 0M
L 12942153 + 12942154 + 0M
L 12942153 + 12942155 - 0M
L 13476236 + 13476238 + 0M
L 12942171 + 12942172 + 0M
L 12942152 - 12942153 + 0M
L 12942169 + 12942170 + 0M
L 13476219 + 13476220 + 0M
L 13476219 + 13476221 + 0M
L 13476218 - 13476219 + 0M
L 12942167 + 12942168 + 0M
L 12942167 + 12942169 + 0M
L 12942150 + 12942151 + 0M
L 12942150 + 12942153 + 0M
L 12942133 + 12942134 + 0M
L 12942133 + 12942135 - 0M
L 13476225 + 13476226 + 0M
L 12942137 + 12942139 + 0M
L 13476227 + 13476229 + 0M
L 13476223 + 13476225 + 0M
L 12942159 + 12942160 + 0M
L 12942159 + 12942161 + 0M
L 13476216 + 13476217 + 0M
L 13476216 + 13476218 - 0M
L 12942144 + 12942157 + 0M
L 12942144 + 12942145 + 0M
L 12942144 + 12942146 + 0M
L 12942144 + 12942147 + 0M
L 12942144 + 12942148 + 0M
L 12942144 + 12942150 + 0M
L 12942144 + 12942153 + 0M
L 12942144 + 12942154 + 0M
L 12942144 + 12942156 + 0M
L 12942161 + 12942162 + 0M
L 12942140 + 12942141 + 0M
L 12942138 + 12942139 + 0M
L 12942154 + 12942156 + 0M
L 12942156 + 12942157 + 0M
L 12942145 + 12942146 + 0M
L 13476229 + 13476230 + 0M
L 13476229 + 13476231 - 0M
L 13476233 + 13476235 + 0M
L 12942157 + 12942158 + 0M
L 12942157 + 12942159 + 0M
L 13476217 + 13476219 + 0M
L 12942160 + 12942162 + 0M
L 13476230 + 13476232 + 0M
L 12942136 + 12942137 + 0M
L 12942136 + 12942138 + 0M
L 12942170 + 12942171 + 0M
L 12942170 + 12942172 + 0M
L 12942155 - 12942156 + 0M
L 13476211 + 13476213 + 0M
L 13476235 + 13476236 + 0M
L 13476235 + 13476237 - 0M
L 12942132 + 12942133 + 0M
L 12942163 + 12942165 + 0M
L 13476231 - 13476232 + 0M
L 13476213 + 13476214 + 0M
L 13476213 + 13476215 - 0M
Thanks for the detailed follow-up. I'm not sure about the blue nodes, but I definitely think it's fine for you to cut your input contigs at the inversion breakpoints as a way to patch this issue in your graph. It would probably be interesting to try to go back to your reads or sample to see if the inversion is an artifact or not. In either case, it is not being well-handled in the graph.
I still don't understand where the VCF error is coming from. If you are ever able to share the data I would be very interested. Over here, I made a little graph with CHM13 and HG002, but inverting a 30mb chunk in the latter. The VCF contains the 30mb inversion and seems perfectly valid. The inversion itself looks like a big loop (like yours) in Bandage, which isn't what I expect, but I need to figure out of that's an issue in the graph or just how the layout algorithm works.. .
I aligned hifi reads to the haplotype with inversion. The alignments at the breakpoint suggest it is a real heterozygous inversion: reads fully aligned to this region is from the inversion, while reads align to both this contig and h1tg000029l is from the haplotype without inversion.
For the error in VCF, what I observed is:
INFO/LV=1
. I am not sure how vg deconstruct
works but it seems that the presence of a large inversion in the graph causes it to generate 30mb alleles for all haplotypes. Other variants are then nested within these 30mb alleles. As more samples are added, the VCF record become longer. So, the VCF error may only occur when there are a lot of samples, likely due to the alleles are too long to store. I am wondering if you tried making a graph with CHM13, HG002 with 30mb inversion, and more samples without 30mb inversion. Is it expected for samples without the 30mb inversion also have 30mb alleles in the VCF?
I am sorry, but due to research ethics, I am unable to publicly share the individual level human genetic data.
Thank you again for your help.
OK, to resume
AT
records (also very big) combine to overflow a buffer somewhere. Here is a snapshot of the current master branch of vg:
http://public.gi.ucsc.edu/~hickey/vg-patch/vg.5bc8f8a55f682d826d5276ae502e4cf6743d68b3
It is patched to not write lines greater than 2Gi in length (you will get a warning and it will otherwise disappear). Using this, you should at least get a valid VCF (please confirm if you are able to test).
This version of vg also has a new option vg deconstruct -L
that merges alleles with sufficient similarity. If you run that on your graph, using say -L 0.9
, it will merge together alleles that are >= 90% identical. Presumably, this will vastly reduce the number of alleles at your big site, allowing it to be represented in your VCF. I'm still testing this option out, but I think it is the only sane way for dealing with megabase-scale variation at this point.
The next vg release is scheduled for July 1 and will contain these options...
Thank you so much! The latest vg
successfully generate a valid VCF.
I generated 3 VCFs:
vg 1.56.0
: vg_156.vcf.gz (invalid VCF)vg
: vg_patched.vcf.gz (valid VCF)vg
with -L 0.9
: vg_patched.cluster.vcf.gz (valid VCF)
vg.1.56 deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls | bgzip -c > vg_156.vcf.gz
vg.patched deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls | bgzip -c > vg_patched.vcf.gz
vg.patched deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls -L 0.9 | bgzip -c > vg_patched.cluster.vcf.gz
I can see warnings only when generating vg_patched.vcf.gz
. It raised a warning exactly for the error record I reported in this issue chr13:72847648 with ID=>12942157>13476226
. It is also very interesting to see another 2 giant alleles in the VCF, which I didn't notice before.
Warning [vg deconstruct]: Skipping variant at chr6:140509080 with ID=>74056097>74056464 because its line length of 4782792297 exceeds vg's limit of 2000000000
Warning [vg deconstruct]: Skipping variant at chr13:72847648 with ID=>12942157>13476226 because its line length of 5981954706 exceeds vg's limit of 2000000000
Warning [vg deconstruct]: Skipping variant at chr3:164429312 with ID=>59766868>60188399 because its line length of 9946836003 exceeds vg's limit of 2000000000
I compared the file size and number of record of the 3 VCFs. Compared to vg 1.56.0
, the patched version exactly drops the 3 large lines. After applying -L 0.9
, the file size further decreased from 16G to 2.6G, while the number of records is very close, so it mainly merges similar alleles as expected.
vg_156.vcf.gz | vg_patched.vcf.gz | vg_patched.cluster.vcf.gz | |
---|---|---|---|
File size | 19G | 16G | 2.6G |
No. of records | 24982571 | 24982568 | 24982295 |
There is one minor issue with allele merging. Although vg deconstruct -L 0.9
didn't raise any warning, it doesn't keep the giant alleles in the VCF. I cannot find it by positions or ID. The largest alleles in those regions are much smaller.
bcftools query -f '%ALT' -r chr3:164400000-164500000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1
879
bcftools query -f '%ALT' -r chr6:140500000-140600000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1
638
bcftools query -f '%ALT' -r chr13:72800000-72900000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1
5701
Even so, the patched vg
works well for my analysis. I can separately analyze the large SVs because they are very rare. I greatly appreciate your help and suggestions!
Hi,
I am running
cactus-graphmap-join
with--vcfwave
and found an empty output VCF for one chromosome in the working directory. It is due to the input VCF is truncated. However,vcf-bubwave
cannot raise the error at once and keep running for other chromosomes.For the error input VCF file: I found one record on chr13 only has 5 fields (i.e., chr, pos, ID, ref, alt). This single record is 1.6GB in plain text (very large alleles). I am not sure if this is caused by that
cactus-graphmap-join
has been killed by SLURM and restart several times due to time limit. I am now re-run this step on another cluster without time limit to see if it can be reproduced. Will provide more information if I have this error again.Part of the error record:
Related segments in GFA:
For
vcf-bubwave
: Because the error VCF file cannot be parsed bybcftools
, the per-chr VCF generated bybcftools view $INFILE -r {} | bgzip > $INFILE-{}.input.vcf.gz
is truncated. With this truncated input, vcfbub will produce nothing and finally result in an empty file. I can see the below error message when I manually run the code invcf-bubwave.sh
, butcactus-graphmap-join
will not raise this error.Besides, in job
make_vcf
,vcfbub
can read the input VCF with the error and will remove the error record with 5 fields in the output. So, the filtered VCF looks quite normal. It seems thatvcfbub
does not check if the input VCF follows VCF specification. Do you think this is also risky?vcf-bubwave.sh
: