ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

vcf-bubwave does not raise error when it fails #1402

Open Han-Cao opened 3 weeks ago

Han-Cao commented 3 weeks ago

Hi,

I am running cactus-graphmap-join with --vcfwave and found an empty output VCF for one chromosome in the working directory. It is due to the input VCF is truncated. However, vcf-bubwave cannot raise the error at once and keep running for other chromosomes.

For the error input VCF file: I found one record on chr13 only has 5 fields (i.e., chr, pos, ID, ref, alt). This single record is 1.6GB in plain text (very large alleles). I am not sure if this is caused by that cactus-graphmap-join has been killed by SLURM and restart several times due to time limit. I am now re-run this step on another cluster without time limit to see if it can be reproduced. Will provide more information if I have this error again.

Part of the error record:

chr13   72847648        >12942157>13476226      TTCCTACCTTTGGCATGCAGCGTGTTTCTGATTGCCTTTTAGTGGGCATGTGCTTAGACCTCAAAACTCACTCTATGTGCAGAGCTTCCTGTAGATTGGGATCTCCTTCACAGAAATGCTGTGTCTCTCTGCGGATGCAGCACACAGGAAGAAGTCACCCACTCGCTACATACATCGCTAATTAGG

Related segments in GFA:

S       12942157        TTTTTTTTTTTTTTTTTGAAAAGCAGGTT                   SN:Z:CHM13#0#chr13      SO:i:72847619   SR:i:0
S       13476226        AAAGCATAGCCAAAGTGTTAAATGGGCAGCATTTTATATTATA     SN:Z:CHM13#0#chr13      SO:i:101634483  SR:i:0

For vcf-bubwave: Because the error VCF file cannot be parsed by bcftools, the per-chr VCF generated by bcftools view $INFILE -r {} | bgzip > $INFILE-{}.input.vcf.gz is truncated. With this truncated input, vcfbub will produce nothing and finally result in an empty file. I can see the below error message when I manually run the code in vcf-bubwave.sh, but cactus-graphmap-join will not raise this error.

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidInput, error: "corrupt deflate stream" })', src/main.rs:38:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Besides, in job make_vcf, vcfbub can read the input VCF with the error and will remove the error record with 5 fields in the output. So, the filtered VCF looks quite normal. It seems that vcfbub does not check if the input VCF follows VCF specification. Do you think this is also risky?

vcf-bubwave.sh:

#!/bin/bash
INFILE=$1
OUTFILE=$2
JOBS=$3
MAXLEN=$4
WAVE_OPTS=$5
for chr in `bcftools view -h $INFILE | perl -ne 'if (/^##contig=<ID=([^,]+)/) { print "$1
" }'`; do echo $chr; done | parallel -j ${JOBS} "bcftools view $INFILE -r {} | bgzip > $INFILE-{}.input.vcf.gz ; tabix -fp vcf $INFILE-{}.input.vcf.gz ; vcfbub --input $INFILE-{}.input.vcf.gz -l 0 -a ${MAXLEN} | vcfwave ${WAVE_OPTS} | bgzip  > ${INFILE}-{}.vcf.gz; rm -f $INFILE-{}.input.vcf.gz $INFILE-{}.input.vcf.gz.tbi"
bcftools concat ${INFILE}-*.vcf.gz | bcftools sort | bgzip --threads ${JOBS} > ${OUTFILE}
tabix -fp vcf ${OUTFILE}
glennhickey commented 3 weeks ago

Thanks for bringing this up. I'm hoping #1403 helps for the error handling at least. As for where this giant variant is coming, that seems like a bug somewhere else?

Han-Cao commented 2 weeks ago

Hi @glennhickey ,

Thanks for the quick fix!

I have re-run the pipeline, and the same error occurs, so it should be a bug. I cannot share the whole gbz / vg files because of the research consent. Do you have any suggestions on what analysis I can do to debug this issue?

The error record can only be found in CHM13-based VCF, when I try to get the record by bcftools view -r chr13:72847648, I got the error:

[E::vcf_format] Invalid BCF, the FILTER tag id=65 at chr13:72847648 not present in the header
[main_vcfview] Error: cannot write to (null)

However, I also searched for it in GRCh38-based VCF by chr13:73625000-73627000 (2kb flanking of liftover region), the error record cannot be found.

The code I used is :

# run cactus-graphmap-join
cactus-graphmap-join $jobpath \
--vg ${outpath}/chrom-alignments/*.vg \
--hal ${outpath}/chrom-alignments/*.hal \
--outDir ${outpath} \
--outName name \
--workDir $workdir \
--reference CHM13 GRCh38 \
--vcf --vcfReference CHM13 GRCh38 \
--vcfwave \
--viz --odgi \
--chrom-vg clip filter --chrom-og \
--gbz clip filter full --gfa clip full \
--haplo clip --giraffe clip \
--filter $n_filter \
--logFile log/cactus-pangenome/cactus-graphmap-join.log \
--maxCores 40 \
--indexCores 39

Thank you!

glennhickey commented 2 weeks ago

I cleaned up this part of the code in #1404, getting rid of the vcf-bubwave script altogether (and doing the logic directly in Toil). This should further improve error handling and resource allocation. If your issue is from a bug in vcfwave itself, it won't help you much but I'm hoping it at least fixes the Cactus end of things.

Han-Cao commented 2 weeks ago

Thanks for the update. I have finished the whole analysis pipeline, but error record still exists in the raw CHM13-based VCF.

I would like to manually check the graph and assembly to see what is wrong with chr13. Is there any method can identify which haplotype harbors the variant >12942157>13476226 from the gbz graph?

Thank you!

glennhickey commented 2 weeks ago

By default the IDs in the VCF correspond to the GFA and not the GBZ which is confusing. As the link says, simplest to use --chop to make sure they are the same.

Your best bet is probably running vg chunk on the xg output to look at your region.

I think if you have an invalid region in your VCF, then this is most likely a bug in vcfwave. I'd suggest checking in the region in the non-vcfwave vcf and following up, if possible, on the vcflib github (though please continue to report anything suspicious you find in cactus here).

Han-Cao commented 2 weeks ago

Thank for your suggestions. I will look into the xg output.

However, the error region should be generated by the MC pipeline:

I will look into the error region in the graph and provide more information about it.

glennhickey commented 2 weeks ago

Oh, I see, thanks for clarifying. I'd understood the error was after vcfwave. If it's in deconstruct that's much more serious. It must be a size limit being exceeded somewhere that's somehow preventing the format tags from coming through or something.

I haven't heard of this before but, like you say, maybe it's happened and nobody noticed because they were only using the vcfbub output which masks it. I just checked and the HPRC v1.1 graphs don't have this issue.

I guess you have an enormous site in your graph somehow and this is failing to deconstruct. I'd be curious if the site looks valid when you view it with gzip -dc instead of bcftools? Given that vg can write it and vcfbub can parse it, it seems most likely the limit is coming from bcftools. But in any case, it sounds like vg deconstruct should probably have a default limit to prevent these cases. And finally, I aggree you should double check your graph -- it does look like maybe there is a strange rearrangement happening...

Speaking of bcftools. Do you know if there's a way to use it to quickly check that the VCF is invalid in this case. Maybe bcftools stats? I can add that to the pipeline to make sure these errors are caught earlier, and even if vcfwave isnt requested.

thanks!

Han-Cao commented 2 weeks ago

(Note: I made some mistakes in the previous analysis and have revised this comment multiple times, please ignore all the previous version of this comment if you already read it)

I did some analysis and I think there could be some issue with deconstruct:

  1. I draw the 2D plot of chr13:70,000,000-110,000,000 in the chr13.og. There is a large bubble in this region. It seems many haplotypes cannot be aligned to the reference genome for this region. Moreover, as this error is only found in CHM13 VCF, I am wondering if it is possible that CHM13 and GRCh38 are in the different paths of the plot. Is there any method to check this?

image

2, I extracted the alternative alleles of this record by zgrep $var_id vcf.gz | cut -f 5 | tr ',' '\n' | awk '{print ">alt_allele." FNR "\n" $0}' > error_alleles.fa and mapped them to CHM13 by minigraph -t 40 -cxasm chm13.fa error_alleles.fa. The first 12 columns of the paf output is listed below. All the alleles are exactly mapped to the same region except for the last allele. However, as the last allele also start from the same position on chr13 and is much shorter than the others, I guess this is where the record is truncated

alt_allele.1    28775088    1   28775088    +   chr13   113566686   72847648    101634483   28727232    28808098    60
alt_allele.2    28783652    1   28783652    +   chr13   113566686   72847648    101634483   28724641    28817745    60
alt_allele.3    28787287    1   28787287    +   chr13   113566686   72847648    101634483   28735578    28813045    60
alt_allele.4    28770863    1   28770863    +   chr13   113566686   72847648    101634483   28724199    28809144    60
alt_allele.5    28781783    1   28781783    +   chr13   113566686   72847648    101634483   28724377    28815735    60
alt_allele.6    28785067    1   28785067    +   chr13   113566686   72847648    101634483   28730621    28815998    60
alt_allele.7    28781653    1   28781653    +   chr13   113566686   72847648    101634483   28727304    28814033    60
alt_allele.8    28786875    1   28786875    +   chr13   113566686   72847648    101634483   28736091    28811685    60
alt_allele.9    28783838    1   28783838    +   chr13   113566686   72847648    101634483   28728763    28815085    60
alt_allele.10   28785444    1   28785444    +   chr13   113566686   72847648    101634483   28729919    28815821    60
alt_allele.11   28791498    1   28791498    +   chr13   113566686   72847648    101634483   28737503    28816330    60
alt_allele.12   28769034    1   28769034    +   chr13   113566686   72847648    101634483   28724551    28806226    60
alt_allele.13   28762813    1   28762813    +   chr13   113566686   72847648    101634483   28713289    28810449    60
alt_allele.14   28863551    1   28863551    +   chr13   113566686   72847648    101634483   28732859    28893410    60
alt_allele.15   28779771    1   28779771    +   chr13   113566686   72847648    101634483   28727122    28813498    60
alt_allele.16   28786090    1   28786090    +   chr13   113566686   72847648    101634483   28733495    28813540    60
alt_allele.17   28747969    1   28747969    +   chr13   113566686   72847648    101634483   28696149    28813859    60
alt_allele.18   28781828    1   28781828    +   chr13   113566686   72847648    101634483   28731890    28812438    60
alt_allele.19   28775901    1   28775901    +   chr13   113566686   72847648    101634483   28722923    28813487    60
alt_allele.20   28781342    1   28781342    +   chr13   113566686   72847648    101634483   28728968    28813841    60
alt_allele.21   28776894    1   28776894    +   chr13   113566686   72847648    101634483   28724942    28812698    60
alt_allele.22   28786983    1   28786983    +   chr13   113566686   72847648    101634483   28731341    28815896    60
alt_allele.23   28785536    1   28785536    +   chr13   113566686   72847648    101634483   28729816    28816808    60
alt_allele.24   28780467    1   28780467    +   chr13   113566686   72847648    101634483   28726272    28813855    60
alt_allele.25   28775716    1   28775716    +   chr13   113566686   72847648    101634483   28728814    28808355    60
alt_allele.26   28775338    1   28775338    +   chr13   113566686   72847648    101634483   28719564    28814298    60
alt_allele.27   28779782    1   28779782    +   chr13   113566686   72847648    101634483   28724134    28814814    60
alt_allele.28   28787742    1   28787742    +   chr13   113566686   72847648    101634483   28731157    28816870    60
alt_allele.29   28773218    1   28773218    +   chr13   113566686   72847648    101634483   28709266    28823654    60
alt_allele.30   28776311    1   28776311    +   chr13   113566686   72847648    101634483   28724667    28811053    60
alt_allele.31   28782570    1   28782570    +   chr13   113566686   72847648    101634483   28735326    28808466    60
alt_allele.32   28775617    1   28775617    +   chr13   113566686   72847648    101634483   28719443    28816078    60
alt_allele.33   28779328    1   28779328    +   chr13   113566686   72847648    101634483   28727660    28813250    60
alt_allele.34   28767089    1   28767089    +   chr13   113566686   72847648    101634483   28722414    28806142    60
alt_allele.35   28777767    1   28777767    +   chr13   113566686   72847648    101634483   28729586    28808926    60
alt_allele.36   28767276    1   28767276    +   chr13   113566686   72847648    101634483   28714088    28815213    60
alt_allele.37   28776527    1   28776527    +   chr13   113566686   72847648    101634483   28726793    28809746    60
alt_allele.38   28781219    1   28781219    +   chr13   113566686   72847648    101634483   28728370    28811829    60
alt_allele.39   28857007    1   28857007    +   chr13   113566686   72847648    101634483   28719018    28898203    60
alt_allele.40   28774567    1   28774567    +   chr13   113566686   72847648    101634483   28722972    28811941    60
alt_allele.41   28793748    1   28793748    +   chr13   113566686   72847648    101634483   28731981    28821469    60
alt_allele.42   28787854    1   28787854    +   chr13   113566686   72847648    101634483   28737044    28812824    60
alt_allele.43   28776767    1   28776767    +   chr13   113566686   72847648    101634483   28725788    28810700    60
alt_allele.44   28781969    1   28781969    +   chr13   113566686   72847648    101634483   28727100    28815147    60
alt_allele.45   28776085    1   28776085    +   chr13   113566686   72847648    101634483   28722839    28812953    60
alt_allele.46   28771838    1   28771838    +   chr13   113566686   72847648    101634483   28725965    28806963    60
alt_allele.47   28764459    1   28764459    +   chr13   113566686   72847648    101634483   28712803    28811349    60
alt_allele.48   28778129    1   28778129    +   chr13   113566686   72847648    101634483   28733236    28806797    60
alt_allele.49   28772800    1   28772800    +   chr13   113566686   72847648    101634483   28726128    28807854    60
alt_allele.50   28777441    1   28777441    +   chr13   113566686   72847648    101634483   28724574    28811689    60
alt_allele.51   28778561    1   28778561    +   chr13   113566686   72847648    101634483   28721830    28816024    60
alt_allele.52   28774594    1   28774594    +   chr13   113566686   72847648    101634483   28728758    28807148    60
alt_allele.53   28779299    1   28779299    +   chr13   113566686   72847648    101634483   28726324    28814334    60
alt_allele.54   28779005    1   28779005    +   chr13   113566686   72847648    101634483   28714684    28825512    60
alt_allele.55   28774398    1   28774398    +   chr13   113566686   72847648    101634483   28721435    28813909    60
alt_allele.56   28782799    1   28782799    +   chr13   113566686   72847648    101634483   28729076    28813409    60
alt_allele.57   28779779    1   28779779    +   chr13   113566686   72847648    101634483   28726687    28812691    60
alt_allele.58   17672661    1   17672635    +   chr13   113566686   72847648    90523622    17639020    17694136    60
  1. Do you think this is really a giant SV with very high prevalence or there is something wrong with the sequence alignment / pangenome construction? Will this issue affect the usage of the graph in downstream analysis, e.g., vg giraffe, vg call. I can share the sequences of the ref and alt alleles if you think they are helpful for debug.

Regarding bcftools, I just tested bcftools stats (1.18) on the VCFs. When using raw.vcf.gz, it will raise Command terminated by signal 11, while using the vcf processed by vcfbub can successfully output statistics.

glennhickey commented 1 week ago

Hi, thanks for the detailed follow-up. I agree that it looks like something is going wrong both in the graph (giant bubble) and in vg deconstruct (invalid entry).

This does seem to be in a region that is more complex in CHM13.

This is presumably due to better representation in this complex region in the hprc assemblies and chm13 than grch38, which allows more alignment and complex graph structures.

But these visualizations are on the "full" graph. After the default minigraph-cactus clipping, the unaligned bits seem to disappear. Note that chr13.vg is from here

vg chunk -x chr13.vg -S chr13.snarls -p CHM13#chr13:70000000-110000000 > chr13.region.vg
chr13.region.vg -b 10000000000 -t 64 > chr13.region.depth
cat chr13.region.depth | sort -nk 4 | head
HG00621#2#JAHBCC010000153.1#0   578 528957  88.3037 5.04057
HG01109#2#JAHEOZ010000195.1#0   2   526627  88.3183 6.08767
HG02145#1#JAHKSG010000325.1#0   189 526695  88.4065 4.8771
HG02572#2#JAHAOV010000392.1#0   1145874 1735671 88.4238 5.71329
NA20129#2#JAHEPD010000302.1#0   1   398903  88.4605 5.47383
HG02723#1#JAHEOU010000278.1#0   1   1513072 88.4677 5.7298
HG01243#2#JAHEOX010000207.1#0   1   525965  88.4713 4.37145
HG03486#2#JAHEOP010000252.1#0   2   525216  88.5117 4.31493
HG02630#1#JAHAOQ010000251.1#0   1   522955  88.5497 4.28433
HG03486#1#JAHEOQ010000218.1#0   1   525144  88.5553 3.76486

In other words there is no path going through this region that's aligned to fewer than 88 other paths.

Furthermore, it looks like most of the region aligns fine between grch38 and chm13

cat chr13.region.depth | grep GRC | awk '{sum += $3 - $2} END {print sum}'
39941468

cat chr13.region.depth | grep CHM | awk '{sum += $3 - $2} END {print sum}'
40000166

Perhaps you can use odgi viz and / or vg depth in this way to figure out which haplotypes are forming the bubble in your graph?

About bcftools stats : thanks for checking. I will add this to the pipeline to make sure output VCFs are valid.

Han-Cao commented 1 week ago

Thanks for the suggestions and help! I did similar analysis on my data.

There are some contigs in this region are not covered by many haplotypes (CHM13 and GRCh38 are listed at the end for reference). Here, I normalized the "depth" to (depth - N_haplotype) / N_haplotype. For HPRC1.1, the normalization equals to (depth - 88) / 88. Is there any method to check if these paths formed the giant bubble in the graph?

contig start end length (depth-N_hap)/N_hap
TAD472#2#h2tg000455l#0 1 370883 370882 -8.7%
GA23228#2#h2tg000324l#0 1 52904 52903 -2.8%
TAD573#2#h2tg000480l#0 1 52268 52267 -1.7%
GA22234#1#h1tg000235l#0 2 154518 154516 -1.2%
GA23207#1#h1tg000288l#0 1 151833 151832 -1.0%
CHM13#0#chr13 69999862 1.1E+08 40000154 -0.1%
GRCh38#0#chr13 1 24513716 24513715 -0.1%
GRCh38#0#chr13 52366383 67793650 15427267 0.1%

I also calculate the number of bases aligned to this region for all haplotypes by cat chr13.region.depth | grep xxx | awk '{sum += $3 - $2} END {print sum}'. However, the distribution looks normal, and I didn't find obvious outliers in this region.

image

glennhickey commented 1 week ago

Huh, I would have thought there'd be one that stood out with depth. Another idea is to run vg chunk -S (example in my previous message), convert the output to GFA (vg convert -fW), and view the resulting graph in BandageNG. This should show you your bubble. It will allow you to click on nodes in the bubble, which you can use with vg find -n | vg view - to see which path they are on. This should at least tell you once and for all the paths that are going through each side of your bubble.

As an aside, this patch #1421 resolves a related error. It seems to be different from what you're seeing, but similar enough that it may be worth trying out all the same. I will make a Cactus release for it soon.

Han-Cao commented 6 days ago

Thank you so much for the great suggestion! It seems one haplotype causes this issue.

I generated the gfa using vg chunk -x chr13.vg -S chr13.snarls -p CHM13#0#chr13:70000000-102000000 -t 40, vg convert -t 40 -fW, the visualization at the error site is: image

There is a giant loop in this region due to the link between 12942158 and 13476225 (in red). The position of the labeled nodes is listed below. The distance between these 2 nodes is ~28.7Mb, which is the size of the giant allele in VCF. And their position on chr13 also closed to the variant in VCF.

Segment CHM13 pos
12942157 CHM13#0#chr13[69999861] 2847758
12942158 NA NA
13476225 CHM13#0#chr13[69999861] 31634372
13476226 CHM13#0#chr13[69999861] 31634622

I then generated a smaller gfa around 12942158 by vg find -n 12942158 -c 10 -x chr13.error_region.xg | vg view - to look into this site. I found that only GA22163#1#h1tg000063l harbors the link >12942158<13476225. Moreover, most haplotypes have 13476225 while only GA22163#1#h1tg000063l has 12942158.

GA22163 1   h1tg000063l 13640679    13642707    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942153<12942155>12942156>12942157>12942158<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211

I used minigraph to this sample to CHM13. It seems there is a large inversion (72Mb - 101Mb on chr13) at this region, but this inversion is not fully covered by 1 contig: h1tg000063l has 56Mb-72Mb-101Mb-99Mb, and h1tg000029l has 99Mb-72Mb-101Mb-113Mb

h1tg000063l 18848877    301 16487557    +   chr13   113566686   56364856    72847656    16445115    16510692    60
h1tg000063l 18848877    16487576    18848557    -   chr13   113566686   99272532    101634483   2357061 2363431 60
h1tg000029l 37440300    11036220    37440275    +   chr13   113566686   72847673    99261873    26359869    26435112    60  
h1tg000029l 37440300    2274    11007249    -   chr13   113566686   101663468   112675390   10961881    11041557    60  
h1tg000029l 37440300    11007269    11036192    +   chr13   113566686   101634494   101663431   28895   28942   60

image

I also found that some of the other haplotypes (including CHM13) are linked by >12942172>13476211 (in blue). But I cannot see the link in BandageNG. Interestingly, this can only be found in CHM13 but not GRCh38. I guess this is why only the CHM13-based VCF has error.

S   12942172    CGTGTTTCTGATTGCCTTTTAGTGGGCATGTGCTTAGACCTCAAAACTCACTCTATGTGCAGAGCTTCCTGTAGATTGGGATCTCCTTCACAGAAATGCTGTGTCTCTCTGCGGATGCAGCACACAGGAAGAAGTCACCCACTCGCTAC
S   13476211    T
W   CHM13   0   chr13   69999861    70002649    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172>13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W   GRCh38  0   chr13   0   1763    >13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W   GRCh38  0   chr13   52366382    52367407    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172

My questions are:

  1. Can I fix this issue by manually split the affected contigs into 2 parts (like what you did for inter-chr misjoin in HPRC)?
  2. Do you think >12942172>13476211 (blue nodes in the plot) is also an issue need to be fixed?

Below is the gfa around 12942158, including CHM13, GRCh38, and the sample potentially with issue for your reference.

Thank you so much for your kind help and patience!

H   VN:Z:1.1    RS:Z:GRCh38 CHM13
S   13476221    A
S   12942131    T
S   13476234    T
S   12942166    AACATCCCAGAGCTGCAGGTCAATGTCAAAGAGTCCCCAGACAGCCAAGGGTTGTGTCTAAGTCAGCATGGCTGAACCAAGTTCAAATCAACTGGATGAGCAGATAATCTCAAGTCTTGGAACTGATTAAAGGAATCCTTGTTTCTAGTGCCCCCAAAAAGGAAAAAAAAAAAAAAAAAAAAAACCTTTCCAGAGGAAGATAATGTTATCTTTTCCCTCAAAATATTTTTCAGATAACTTCATCACTTGTGTCTTCAGTGAGTAACATACTATGACAGACTACTGGAATATTTTCAGTAATTATAGCAGAATGGGCTGGGGAGGAGTATGGAGAAGCATACACCAGCTTTAAAGCCTTCCTCTTAGACGTGAAACACATCATTTCCACTCACACTACATTGCCTAAGGTAAATCTCATAACTGCACTTGAGCTTGATAGAGTTAGGAAATGCAATTTCTCCAATAAAATAGAAACCAGAAACGGATAAGCAGTCATGCAATTTTTCAGGGCCTGCCAAATATTTATTTTCCTCTTCTTTCTCAGTGCAAATCACACTCAACCTCTCCCCAAAGGACATATCCAAAATTTCCAACCAATCACTACTCTCAAAGTTCAGGATCTCCAAGTGACGCACAACAGCCTCTACATTGGGTCTAGATCTGACTTTTGATCTGGAGATTTGATAACTAGAGACAAGGTATCGGCCCTTTACACTCCAAAAAGGCAATGGGAAGGGATAACACCTGTACACTATCCTCTCCTGAAAGAGCAGAAATAGGAGAAGTCACTAGTAAATAACAATTCCAAAGTTACGCAGAGCATATGTGCTGAGGGGCTCGTATTCAGGGTGAGATGTGGTCTCCTCTCTGACATGACCTTCCCAGCCCATTGCTCTCTGTGCCCCTTTGCTCAGCTCTCTGAGGGGTCCTCCTTTTGTCCATTATCGTCCTTAACCACACCTTGAGAGAGCATTGAAGAAAACGCCTTCCTTGAGAGCTGAGCAGCTTTCTTAGCCAGCTTCCTTCTAGCAGAAGCATGGGTGCCTATGTGGCTTTTTAAGTACCAGCTACAGCCTCTTTTAGTCCAAGTGATGGCCCTTGGCACAACAAAATTCTCTCAAAAATCTAGATAGTTTATTACCTATTTTATTCCAGGCAATTTTCTTGTTCCATGTCCCACACCTAACATACATGCATGTATATTGTGTTTATGAGACAGAGATAGGAGAGTTACATGTTTGTCTGTTTTTCTTCAGCCATTTTATCCAACTAAAAGAATTTACTGGGAAAACCTTAATCTCTTCAGAAGTTTTAACAATGAATATGGCAACCATGCTCTTATTTTATCCTCATCTTAACGCTAAGTTTTGATTAGCCTCCACATTTATTTGTATTTTATTTATTTAAAACTCCCCCCTTTTTTCTTAGTCTAGCTGAGGGTTTCTCTATTTTTTCAAAAAACCAACGCTTATTTTTGTTGATTTTTTCTATAGATTTTCTAGTAGAAATAATTCTGCCTGATCTTTATTATTTCCTTCCTTCTGCTAACTTTGGATTTAGTCTGCTCTTCTTTTCCTAGTTCCTCGAGATGTAAAGTTAGGTTGTATTCCAAAAATTCCTCTAGGTTTCTGTGTCAGTGGATGATGGAGCAAAAAAAAGAAAAAAAAATTCCTCAGTTTTATCTTTTAATAATTAGGTTTAAGAAACAGTTCTCTTTTCAATCCTGCAAGTACTTGAATTTACATTACCGCTATTCTTTTCCATTCTTGCTTCTTGCTTGCAAATTAGTCAGTTTTGTTCTGATTTTGTCTCTTTTAAGATCTCTGTACCCACTTCACAGATGAAGAAAAATAACTATCCAGGCTGGGCACAGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTCAAGGCGGGCAGATCACTTGAGGTCAGGAGTTGGAGACTGGCCTAGCCAACATGGGGAGACCCCTGTCTCTGCTAAAAATAAAAATAAATTAGCTAGGCGAGGTGACGGGCACCTATAATCCCAGCTACTCAGAAGGCTGAGGCATGAGAATCGCTTGAACCCGGAAGGTAGAAGTTGCAGTGAGCTGAGATTGCACCACTCCATTCCAGCGTGGGCGACAGAGCGAGACTCCATCTCAAAAAAAAAAAAAATGAAACAAAAAAGGCCAGGCCCGGTGGCTCACCCCTGTAATCTCAGCACTCTGGGAGGCTGAGGCAGGTGGATCACCTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCATCTCTACTGAAAATACAAAAATTAGCCAAGTGTGGTGGCGCATGCCTGTAATCCCAGCTACTCCGGAGGCTGAGGCAGGAGAATCGCTTGAACCCCGGAGGCAGAGGCTGCAGTGAGCTGAGATTGCATCAGCACACTCCAGCACTCCAGCCTGGGTGATAGAGTGAGACCCTGACTCAAAAAAAAAAAAAAAGAAAAGAAAAGAAAAGAAAGAAAGAAAAAGAACTCTCCAGCCACTCCAGAAGGCCCCCAGAATGTCCAGGACCAACCACAGCTCTCTACCTCCATGAGGAATTACTCTCACAGCTTTTAGAGGACTCATTTCCTTACATTATTACTGTATTGTTTTGTTATCCTAATTTACATCCCTAGATACTGTAGTTTAGCCTTCTCATTTCTTAAAATTGATACTGTAGCAGGACGAGCCGCAGACCAAACCTCTCAGACACCTAGTTGTAGAAGGAAGGGCTTTATTCAGCTGGGAGCATCGGCAAGTTACTGCCTTAAAATCCGAGCTTCCTGAATGCACAATTTCTGTCCCTTTTAAGGGCTCACAACACTAAAGATTTCACATGAAAGGGTCATGATTGATTTGAGCAAGCAGGTGGTACATGACAGAGGCTGCATGCACCAGTGGTCAGAGAGAACAAGAACAGGGCAGGGAGTTTCACAGTGTTCTTCTATATAATGTCTGGAATCTATGAATAACACTGGTTTCTAAGTTATGAGTTGATTTTTAACTACTGGGTTTAGGCCAGGCATGCTCAGACCTGGTTTCTGGCCTGGCGCCAGGCTGCCTGTCTTTGGTTTTACTTCCTTGCTGTTTTTTCTTAAAACAGGTACTGAGTATAAAACAATATGAGAGGGTCTCTGTCTTCCTTCAATACCTTTAACTTCCTTTCAATCTATCTTTTTTCTTCTCTCCTAACTCATTTGTTGGAAGAGCCTACACTTGTGAGCTGTAGTTTCCCGTTGCTTGGCTTTTGCTAAGTGAGTTCTCATGCTTCAGGGCAACATGTTGCTTTTTCCTGCATATTGGCAGCTGGATCCAGAGTCTTGATCAGAATCAATTTGATCCCTTTGGCAAGGTTAGAGTGGTGGTGTGTTCATTCATCATGAAGCACATTATGTCTTTCTCTTTCTTGATGTTAGCAGCCTTTGATGCTCAATGCTCAGGACCCTTTAATTCACAGGAGTTTACAAATGGTAATATTTTAATTCTATCATTTTGTTTTCACTTATTGGCTGTAATAATTTTGTAAAACGACACTTCTGCTCATCTGCTACTTGATTACCCAGGGATAGCGGCTATATAGAAAAGGCAGGATACATAGATTGATATTCTCTTGTTGATAGTCATTACAGACTCTTAAGTTTAAACATATTTGATAGTTTTTAACCCATTGAAATCTGTATCTTTATTAAGGTTCAAATTATCCTACCCTTGGTGAATACAAGCCTCTTTGAATTGACTCTTGTGTCCTTTTGACACAAACCTAGTAGCTTTTTATAGCTTCTATGCTATCTGGTATGTCAAAATGTCCCAGTTTCATTTTGTACACTTCCTGTCCACAATCAATGTTCCAAGAACCCCTCTTTTTAAAAAATGGAAATGATCTTTGAGACCATAGTCTGGATGCTGTGAATGTGTGTTCATAGTGAGCTGGTCATTGTTTCTTGACCTCTTTATGGATAGAGCTAGAAATTGCACACACACACACACACACACACACACTCCTTTTTCTTTTCCTAAACATGTCTTTCCAGCATGCCCCTTCTTTTCTCTTTGTTATAGCATGGGATTCATCACAATATGTTTTTATTGCTTCTTTTTATTTTCCCAAATTTTAGAAGAAAGATTGGTTAGCTAGCTCTACTAACTTTTTTTCAAGCTCACAAAATTTTCTCTTCTCTTGTTTTTATGAGATTTTAAAAATAGCACCTTGTTTTGGGAGTTTTTCTTACTTGTTTTCCCACCTCACTATAATCTTGACCTATTTTTTCCTGTGGGATTTTTTTCTATATCCTAATTTCTGACTTGAAATTGCACTGTCAAGAAAATAATTTACACGTAGTATGATTTTATAAGCAAAGATGCTCATTGCTGTGTTTTTAACTAGAAAAAGTTAAAATGTTTTTGTTATTATTTCTAAAAAGGGGGAAATTGTTAATTTGTGTTCAAAACCAGAGGAATAATTAGACAAATCATTGCTGTTATTAAAGACTTACCTGGAAAAAATTTGGAAGTATTTCTCCCTGTGTCAAATCTTAAAGACATGGCTCATTGAATCAAACAAGGAAACCACCCACATTGATTTATAAACATAAATTAGTATTACTGAATGGTAGTTTGGGAGAAAAATATTTTCTAATAAAGGATTAATGCTGCAAGTTACCTAGTACCTAGTTCTTCCTTACCTTCTGCTATTGGTATGGAAATATTAGATCTCCTTGAAATAGACTGGTTGCTATTCATCTTAACCTTTAATATATCAGACCCAAATTTCTTTATACTGACAGTTCAACAAGTGAAATTAAATAATTAAATTGACTTAATCTGATTAACATTTTTTTCTTTCTACATATCTTCAATAGCTTTACTTTAATGAACTCTCTATAGAAGCCCTCCAACATTTCTGGTGTCAGACATTACGGACAACTGAATATATCATGAGTAGGTCTCATAGGGAGAATGTGGAACCTGTTCTTCCCTTCTTTCAGGCTTTAGGGAAAGAGTGAAAAATAAGAATAAATTCTGACATACTAAACAAGACAGCATATAAAATAGAATATTCACAAATCATTTGGTTACCTTATTTTTTAAACCTCATATAGCCAAGACAAATTTTAAACTAAGAAGTACTAGTGGGGATTTGCAGGGATCGTGGCGGATGGGAGGCAGGACTAGATTGCAGCTCCGGACAGAGCAGCATGTGGAGGCTTGCAATGTGAATTTTAGCTCCAGATCAACTGCAGGAACAAACTAGCAATCCTGAGAGGAGCCACAGACCCTCTGAAGGAAGCTGACTGCTCCTGCAGGACCTCGAAGACACCCCAGATACTATGAGTAACCCCAACTACAGAAGTGGGAAAGGGAGGCCCTCCTCTCCTGAAAACACACCCCCACTGGAGAAGCTGAAGGTCTGTTTGTGGGAGAAGTTTCTGACTTTACCTGGAGTTGAGTCAAGTTAGAGAGCCGACCCCAGCGAAATACGGGAGTAGAGGAAGCAGCAGAAAGGCCCTGGGAGCTCGCTGGGTCCCCAAGCAGCCCATTCCTGCCTGGCACCACAGGGATCCATAGGGAGGGTGGCCAGAGGAACAGGGGGTAAAACTCCACAGGGGGTTTTTAGCTGGCCTTTGTAACAATTTGAACTGGGCAAGAAGCCTCCTGGCCAGAACTTGGGGGAGAGTGCGAA
S   12942134    A
S   12942148    T
S   13476238    TTGAGGTCAGGAGTTCGAGACCAACCTTACCAACATGATGAAACCCCATCTCCACTAAAAAAATACAAAATTAGCCAGGCATGATGGCAGGCACCTGTAATCTCAGCTACTTGGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGCAAAGGTTACAGTGAGCCGAGATCACGCCATTGCACTCCA
S   13476226    AAAGCATAGCCAAAGTGTTAAATGGGCAGCATTTTATATTATA
S   12942139    ATATTCACAACCAATACAGTGATAGGATCTTAAGGTTGTAGATGCATACTGTGAAGCCAGTCTCCCCTGACCCCTGCACATGATTCCTAAATCAGGTTACAATAAAATTGATTTGTAAACTATTTCCATAGTCCACTAATGGCAAATCTGGCATCGTTTGCTTTTTAGTGCTGTCTTTAAAGATACATAGGAGCAAGACACCTCTATGAAATGAGGTGCTACAAGAAGGAAATTTGGAGCTTGAAGAAAAGAAAGGTCTCCAGAAACTATAGCACGGACAAGACTGAAATTGTTTTAAAAATAGCTAATTCTTAAAAACTGAATCTAAACTCTGCTTGTGAAATGCTCCTTGGAGCAGTATAACAAGTAAAAGTCCAGTAAGACCACAAAGGTTCCAGCGAACACCACCAGGAAACTCC
S   13476212    G
S   12942165    C
S   13476222    TGTAAAACTTTTCCACCATTTAACAAGTGAACATTAGAAAAAAATCCAAGATTATTAGATTGTCTCATTTCTTACAACAGTTCTACAAATTGCAAATTGGTGATCCATAACATTTGTGCATATATACTAAATGATTTTAATAGTAATATCCATGACACTTTTGATAGTTTGCTTCAAAAACTGGAGTTACAGA
S   12942146    T
S   12942141    AGCTTGTTGAACCTTGGCAGCTCCTGTGAAGGAGATCCCAGTCTACAGGAAGCTCTGCACATAGAATGAGACGGGCTTCACTGCATCTCTTACTCTTCTTGGGGTCCACTGTGGCTTCTAAACTAAAGGTGCACCTGAAGTAAGTTCTATGTGTATTTTGATAGGGATTAATTTAAAATAAACTCCAGAAGCATTAAAACAACCAGCACAGAAAGTTTGTCAATATCCCCGTTAATATCTGATTTTAAAGATTCAATTTCTTTTCTATGTTTACAGGCCAATTTTTAAA
S   12942143    G
S   13476224    C
S   13476228    T
S   12942149    G
S   12942147    T
S   12942151    C
S   12942168    T
S   12942142    T
S   13476215    G
S   12942162    T
S   13476232    GGCCAAGAATATAAGCTCAGATTCTGTTAAAAATATTTCTCAAAGGCTGGGC
S   12942135    G
S   13476237    C
S   13476220    G
S   12942164    GA
S   12942158    TCCTACCTTTGGCAGGAAAATATTTTG
S   13476214    T
S   12942153    T
S   13476236    A
S   12942171    ACTTCACAGGAAGGGGAAGAACTAAAGCCCTTTTCTTTCGCAGCTGGAAGGAGGAAAGCCTCAGGAAAGTTTTCAAGCTCTTCTTTGCCCTCCACCTGGAAGCAGACTGTTGGGGGAGGCATGGTGGGAGTGAGACCAGCTCTTATGTCTTTGCATGGGAGCTGGGTGAGGCCTGTGACTGCCAGCTTTCCCCTATTTCCCTGACAATGTGCATGACTCAGCAGAGGCAGCCATAATCCTTCTGGGTACACAACTCCAGTGGCCTGGGAATCTCACCCCCATCTCCCACAGCAGCTGCAGCAAGACCCACCCAAGGATAGTCTGAGCTCAGACACGCCTAGCCCTACCCCCACCTGATGGTCCTTCACTATCCACCCTGGTAGCAGAAGACAAAGGGCATTTAATCTTGGGAGTTCTAGGGCCCTGCCCACCATCTGTGCCTCTCCACACTACTATAGCTGATGCTCTCTAGAAAGCACCACCTCCTGGCAGGAGGCCAATCGGCACAAAAATAGAGCATTCAACCACCAAAGCTAAGGACCCTCACAGAGTCCATTGTATCCTCTGCCACCTACACCAGAAGAGGCACTGGTATCCACGGCTGAGAGACCCATAGACGGTTCAAATCACAGGACTCTGTGCAGACAACCCCCAGTACCAGCCCAGAGCCGGGTAGACTCGCTGGGTGGTTAGACCCAGAAAAGAGACAACAATTATTGCAATTCGGCTGACAGGAAGCCATATCCATAGGAAAAGGAGGAGAGTACTACATCAAGGGAACATTCTGTGGGGGAAAAAACAAAACAAAACCTGAACAACAGCCTTCAGCCCTAGAACTTCCCTCTGACAGAGTCTACCCAAATGAGAAGGAACCAGAAAACCAACCCTTGTTATATGACAAAACAAAGCTCTTGAACACCCCCAAAAATCACACTAGTTCACCAGCAATGGATCCAAACCAAGATGAATTTCCTGATTTACCTGAAAAACAATTCAGGAGGTTAGTTATTAAGCTAATCAGGGAGGCACCAGAGAAAGGCAAAGCCCAATGCAAGGAAATCCAAAAAATGATACAAGAAGTGAAGGGAGAAATATTTGAGGAAATAGCTAATTTAAAGAAAAAAACAATCAAAACTTCAGGAAACATTGGACACACTTACAGAAATGCAAAATGCTCTAGAAACTGTCAGCAATAGAATTGAACAAGTAGAAGAAAGAAATTCAGAGCTCGAAGACAAGGACTTTGAATTAACCTAATCCAACAAAGACAGAGAAAAAAGAATAAGAAAATATGAACAAAGCCTCCAAGAAGTCTGGGATTATGTTAAATGACCAAACCTAAGAACAATCGGTGCTCCTGAGGAAGAAGGCAATTCTAAAAGCTTGGAAAACATATTTGGTGAATAAACGAGGAAAATTTCCCCAGCCTTGTTAGAGACCTAGACATGCAAATAAAAGAAGCACAAAGAACACCTGGGAAATACATCGCAAAAAGATCTTCGCCTAGGCACATTGTTATCAGGTTATCCAAAGTTAAAACGAAGGAAGGAATCTTAAGAGTTGTGAAACAGAAACACCAGGTAACCTATAAAGGAAAACTAATCAGATTAATAGCAGATCTCTCAGCAGAAACCCTACAAGCTGGAAGGGATTGGGCCCTATCTTCAGCCTCCTCAAACAAAACAATTATCAGCCAAGAATTCTGTATTCAGCAAAATTCAGTATCATGTATAAAGGAAAGACATAGTCGTTTTCAAACAAATACTGAGAGAAATAATTCACCACTATCAAGCCACCACTACAAGAACTGCTAAAAGGAGCTCTAAATCTTGAAACAAATCCTGGAAACACATCAAAACAGAAACTCTATAAAGCATAAATCACACAGAACCTATAAAACAAAACTAAAAGTTAAAAAGCAAAAACAAAAAACAAAAAAATCCAAAATTCCCAAACAACAAAGAGCATGATGAATGCAACCCTACCTCACATTTCAATAGTAACATTGATTGTAAATGGCCTAAGTGCTCCACTTAAAAGATGCAGAACCGGCTGGGTGCAGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTCAAGGTGGGTGGATCACCTGAGGTCAGGAGTTCGAGACTAGCCTGGTCCACATGGTGAAAACCTGTCTCTACTAAAAATATAGAAGTTAGCCAGGCATGGTGCAGGTACCTGTAATCCTAGCTACTTGGGAGGCTGAAGCAGGAGAATTGCTTGAACCCGGGAGGCAGAGGTTGCAGTGAGCCAAGACCGCACCATTGCACTCCAGTCTGGGCAACAAGAGCAAAACTCTGTCAAAAAAACAAACAAAAAACAACAACAACAACAACAAACAGATACAGAACCACAGTATGGATAAGAACTCACCAATCAACTATCTGCAGCCTTCAGGAGACTCACTTAACACATAAGGACTCACACGAACTTAAAGTAAAGGGGTGGAAAAAGGCATTTCATGCAAATGGACACCCAAAGTGTGCATGAGTAGCTATTCTTATATCAGACAAAACAAACTTTAAAGCAACAGCCATTAAAAGAGACAAAGAGGGACATTATATAATGGTAAAGGCCTTATCCAACAGGAAAATTTCACAATCCTAAACATATATACATCTAACACCGGAGCTCCCAAATTTATGAAACAATTACTAATGGACCTAAGAAATGAGACAGACAGCAACACAATAATAGTGGAGGACTTCAATTCTCCACTGACAGCAATAGACAGCTCATGAAGCCAGAAAGTCAACAAAGAAACAATGGATCTAACTATACCTTTGAACAAATGGACTTAACAGATATATATGGAGCATTTCATCCAACAACTGCAGAATACACATTCTATTCAACAAGGCATGGAACTCTCTCCAAGATAGACCATGTGATAGGCCACAAAACAAGTCTCAATAAATTTAAGAAAATTGAAATTATATCAAGCACTCTCTCAGACCAGAGTGGAAAAAAAACTAGAAATCAACTGCAAAAGGGACCTTTAAAACCATGCAAATACATGGAAATTAAATTACCTGGTCCTGAATGAGCATTGGGTCAAAAACGAAATCAAGATAGAAATAAAAAATTATTTGAACTGAAAGACAATAATGACACAGCCTATCAAAACCTCTGGGATACAGATAAGAGGAAAGTTCATAGCCCTAAAGGCCTACGTCAAAAAGTCTGAAAGAGCTCAAACAGACAATCTAAGGTTACACCTCAAGGAACTAGAGAAACAAGAACAAACCAAACCCAAACCCAACAGAAGAAAAAAATAACGAAGATCAGAGCAGAACTAAATGAAATTGAAACAAAAAAATACAAAAGATAAATTTAAAACCTGGTTCTTTGAAAAGATAAATAAAATTGATAGACCATTAGCAAGATTGACCAAGGCAAGAAGAGAGAAAATCCAAATAACCTCACTAAGAAACAAAACAGGAGTTACTACAACTGACACCACTGAAATACAAAAGATCATTCAAGGATACTATGAACACCTGTATGCGCATAAACTAGAAAACGTGGAAGAGATGGATAAATTCCTGGAAAAATACAACCCTCCTAGCTTAAATCAGGAAGAATGAGATACCCTGAACAGACCAATAACAAGCAGTTAGAGTGAAATGCTAATTTAAAAATTACCAATAACAAAAAAAGTCCAGGACCAGACAGATTCACAGCAGAATTCTATCAGAGATTCAAAGCAGAATTGGTACCAATTCTTTTGACACTATTCCACAAGATAGAGAAAGAGGGAACCCTCCCTAATTCATTCTATGAAGTCAGCATCACTCTACTACCAAAACCAGGAAAGGACACAACCAAAAAAGAAAACTAGAGACCAATATCCTTGATGAACATATATGCTAAAATCCTTAACAAAATACTAGCTAAGTGAATCCAACAACATATCAAAAAGATAATCCACCATAATCAAGTTGGTTTCATACCAGGGATGAAGGGATGGTTTAACATATACAAGTCAATAAATATGATACACCGTATAAACAGAATTAAAAAGAAAAATCACATGATTATCTCAATAAGATGCAGAAAAAGCATTCAACAAAATCCAGCATCCCCTTATGATTAAAACTCCCAGCAAAATTAGCATACAAGGGACATACCTTAATGTAATAAAAGCCATCTATGACAAACCACAGCCAACATAATACTGAATGGGGAAAAGTTGAAAGCATTGTCTCTAAGAATGGGAACAAGACAAGGATGCCCACT
S   12942152    G
S   12942169    C
S   13476219    TTAACAGCATTTGTTGAGGTTTTTTTCTATAAAAATTGATTTGACTTTTAAGCAAAAGTATATTAGTTTTAAAATTGTGTTTTCCAAGTTAAATTTACTAAATCTTGTGTCAATAGAATATTATTAATATTTAATTCAAAGAGTGTTGCATAAACTGAGTGATTATTCTAAATTTATTAATTTCAAAAATAATTCTTTGAATTTTCCTTAAAGATTTTGAAGCCAGTTTACATAATGTTGTTAACAATTAACATATTCTTTATGCTGATTTATGCTAAAATATATGAAATAATTATTGCTCATTTGTATGATGAAAAGTTTTAATTTCAACATTTCAACATATATTTTTGATTTTGTCTAGTTAGGTCAGAAATTAGCTTTTCCTTTCCTTGGAACTTCAAATTTAGCTCATTAATGTGCAGTATAATATTAGTGAGAAAACAAGTTACTGTCATTTTTTGTCTTTGACTATTAAACATTTAGCAAGCATTTCTTTTGCTTTAAGAAAATACAGAATTAGATTCAAAAGTATAGTAAATCC
S   13476218    C
S   12942167    T
S   12942150    T
S   12942133    ACATAACAGCCTGAATTAGTTACATAATAAAAGTGAGAACAAATTTATAAA
S   13476225    ATTTTCAATATGTGTCATGCAGTAGAATGAAGCAATTAGAGAAATGTCAGGCCCACATTAAATATCAGTAAATCTTTATCATTGACCTAACATAGCTGGAACATTGACAACTGTGATAGAACAAATTTTTTATATCTATCTGAAATCATTCTTTGAAAAATATGAAAGTTCCAAAAATATTAATGACAGGAGTTAAATTTTTTAGAATGCACATTGAAATTTCTTTAAAAAAACTGAAATATTTTGAGAC
S   12942137    T
S   13476227    C
S   13476223    T
S   12942159    T
S   13476216    ATTAATTCATAAAATTACAGTTGAAGAAGCAGATTCATATACCCAAGATGTTCAAAATACACTTAAACACTTTCCAGTAACTTTAACTACTAAGAATCATTTTCCATTAATAATCTCACTCAC
S   12942144    CTC
S   12942161    TG
S   12942140    AGTGATAACCCCA
S   12942138    C
S   12942172    CGTGTTTCTGATTGCCTTTTAGTGGGCATGTGCTTAGACCTCAAAACTCACTCTATGTGCAGAGCTTCCTGTAGATTGGGATCTCCTTCACAGAAATGCTGTGTCTCTCTGCGGATGCAGCACACAGGAAGAAGTCACCCACTCGCTAC
S   12942154    T
S   12942156    T
S   12942145    T
S   13476229    AACTCATCTAAAGCTTAACTATAAACCTGAAATGTTTAAAATTTTGAATCAATTAATCATTGCTATTGTTAGAAACATCTTGTATTCTCCAGGTGATTTTTTGGTAGCTTAATTGAAGATCTTTCCACTTTTGTAAAATAATTCTTTGAGTATTTTTCTTTTAACTTTCTAACCAAATTTACATAAGTGAAATAGTGATTTATTTTGCCATCTCTTCATCACAATATTGTTTTGTTTGGGTATGTGTGTTCAAAAGTATAGGCTTCTTTATAGG
S   13476233    G
S   12942157    TTTTTTTTTTTTTTTTTGAAAAGCAGGTT
S   13476217    A
S   12942160    CC
S   13476230    G
S   12942136    CAATTTATGTTTGTTTATATTCTGGATGTAATTCACACTTA
S   12942170    TGGCATGCAG
S   12942155    G
S   13476211    T
S   13476235    CAGTAGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCTGAGGTGAGTTGATCA
S   12942132    G
S   12942163    AC
S   13476231    A
S   13476213    TTAAAATGCTAACTTTTTATCTGAACTATTTGATTTA
W   CHM13   0   chr13   69999861    70002649    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172>13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W   GRCh38  0   chr13   0   1763    >13476211>13476213>13476214>13476216>13476217>13476219>13476220>13476222>13476223>13476225>13476226>13476227>13476229>13476230>13476232>13476233>13476235>13476236>13476238
W   GRCh38  0   chr13   52366382    52367407    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942145>12942146>12942147>12942148>12942150>12942153>12942154>12942156>12942157>12942159>12942160>12942162>12942163>12942165>12942167>12942168>12942170>12942172
W   GA22163 1   h1tg000029l 11026228    11036363    >12942159>12942161>12942162>12942164>12942165>12942166>12942167>12942169>12942170>12942171>12942172
W   GA22163 1   h1tg000063l 13640679    13642707    >12942131>12942133>12942134>12942136>12942137>12942139>12942140>12942141>12942142>12942144>12942153<12942155>12942156>12942157>12942158<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211
W   GA22163 2   h2tg000014l 0   1020    <12942172<12942170<12942168<12942167<12942165<12942163<12942162<12942160<12942159<12942157<12942156>12942155<12942153<12942144<12942142<12942141<12942140<12942139<12942137<12942136<12942134<12942133<12942131
W   GA22163 2   h2tg000028l 10682061    10683824    <13476238<13476236<13476235<13476233<13476232>13476231<13476229<13476227<13476226<13476225<13476223<13476222<13476220<13476219<13476217<13476216<13476214<13476213<13476211
L   13476221    +   13476222    +   0M
L   12942131    +   12942133    +   0M
L   13476234    -   13476235    +   0M
L   12942166    +   12942167    +   0M
L   12942134    +   12942136    +   0M
L   12942148    +   12942150    +   0M
L   12942148    +   12942152    -   0M
L   13476226    +   13476227    +   0M
L   13476226    +   13476228    +   0M
L   12942139    +   12942141    +   0M
L   12942139    +   12942140    +   0M
L   13476212    -   13476213    +   0M
L   12942165    +   12942166    +   0M
L   12942165    +   12942167    +   0M
L   13476222    +   13476223    +   0M
L   13476222    +   13476224    +   0M
L   12942146    +   12942147    +   0M
L   12942141    +   12942142    +   0M
L   12942141    +   12942143    -   0M
L   12942143    -   12942144    +   0M
L   13476224    +   13476225    +   0M
L   13476228    +   13476229    +   0M
L   12942149    -   12942150    +   0M
L   12942147    +   12942148    +   0M
L   12942147    +   12942149    -   0M
L   12942151    +   12942154    +   0M
L   12942168    +   12942170    +   0M
L   12942142    +   12942144    +   0M
L   13476215    -   13476216    +   0M
L   12942162    +   12942163    +   0M
L   12942162    +   12942164    +   0M
L   13476232    +   13476233    +   0M
L   13476232    +   13476234    -   0M
L   12942135    -   12942136    +   0M
L   13476237    -   13476238    +   0M
L   13476220    +   13476222    +   0M
L   12942164    +   12942165    +   0M
L   12942158    +   13476225    -   0M
L   13476214    +   13476216    +   0M
L   12942153    +   12942154    +   0M
L   12942153    +   12942155    -   0M
L   13476236    +   13476238    +   0M
L   12942171    +   12942172    +   0M
L   12942152    -   12942153    +   0M
L   12942169    +   12942170    +   0M
L   13476219    +   13476220    +   0M
L   13476219    +   13476221    +   0M
L   13476218    -   13476219    +   0M
L   12942167    +   12942168    +   0M
L   12942167    +   12942169    +   0M
L   12942150    +   12942151    +   0M
L   12942150    +   12942153    +   0M
L   12942133    +   12942134    +   0M
L   12942133    +   12942135    -   0M
L   13476225    +   13476226    +   0M
L   12942137    +   12942139    +   0M
L   13476227    +   13476229    +   0M
L   13476223    +   13476225    +   0M
L   12942159    +   12942160    +   0M
L   12942159    +   12942161    +   0M
L   13476216    +   13476217    +   0M
L   13476216    +   13476218    -   0M
L   12942144    +   12942157    +   0M
L   12942144    +   12942145    +   0M
L   12942144    +   12942146    +   0M
L   12942144    +   12942147    +   0M
L   12942144    +   12942148    +   0M
L   12942144    +   12942150    +   0M
L   12942144    +   12942153    +   0M
L   12942144    +   12942154    +   0M
L   12942144    +   12942156    +   0M
L   12942161    +   12942162    +   0M
L   12942140    +   12942141    +   0M
L   12942138    +   12942139    +   0M
L   12942154    +   12942156    +   0M
L   12942156    +   12942157    +   0M
L   12942145    +   12942146    +   0M
L   13476229    +   13476230    +   0M
L   13476229    +   13476231    -   0M
L   13476233    +   13476235    +   0M
L   12942157    +   12942158    +   0M
L   12942157    +   12942159    +   0M
L   13476217    +   13476219    +   0M
L   12942160    +   12942162    +   0M
L   13476230    +   13476232    +   0M
L   12942136    +   12942137    +   0M
L   12942136    +   12942138    +   0M
L   12942170    +   12942171    +   0M
L   12942170    +   12942172    +   0M
L   12942155    -   12942156    +   0M
L   13476211    +   13476213    +   0M
L   13476235    +   13476236    +   0M
L   13476235    +   13476237    -   0M
L   12942132    +   12942133    +   0M
L   12942163    +   12942165    +   0M
L   13476231    -   13476232    +   0M
L   13476213    +   13476214    +   0M
L   13476213    +   13476215    -   0M
glennhickey commented 3 days ago

Thanks for the detailed follow-up. I'm not sure about the blue nodes, but I definitely think it's fine for you to cut your input contigs at the inversion breakpoints as a way to patch this issue in your graph. It would probably be interesting to try to go back to your reads or sample to see if the inversion is an artifact or not. In either case, it is not being well-handled in the graph.

I still don't understand where the VCF error is coming from. If you are ever able to share the data I would be very interested. Over here, I made a little graph with CHM13 and HG002, but inverting a 30mb chunk in the latter. The VCF contains the 30mb inversion and seems perfectly valid. The inversion itself looks like a big loop (like yours) in Bandage, which isn't what I expect, but I need to figure out of that's an issue in the graph or just how the layout algorithm works.. .

Han-Cao commented 3 days ago

I aligned hifi reads to the haplotype with inversion. The alignments at the breakpoint suggest it is a real heterozygous inversion: reads fully aligned to this region is from the inversion, while reads align to both this contig and h1tg000029l is from the haplotype without inversion. image

For the error in VCF, what I observed is:

  1. The error record of the inversion has a lot of alternative alleles, with each allele in size of ~30mb
  2. All other variants within the 30mb region have INFO/LV=1.

I am not sure how vg deconstruct works but it seems that the presence of a large inversion in the graph causes it to generate 30mb alleles for all haplotypes. Other variants are then nested within these 30mb alleles. As more samples are added, the VCF record become longer. So, the VCF error may only occur when there are a lot of samples, likely due to the alleles are too long to store. I am wondering if you tried making a graph with CHM13, HG002 with 30mb inversion, and more samples without 30mb inversion. Is it expected for samples without the 30mb inversion also have 30mb alleles in the VCF?

I am sorry, but due to research ethics, I am unable to publicly share the individual level human genetic data.

Thank you again for your help.

glennhickey commented 1 day ago

OK, to resume

Here is a snapshot of the current master branch of vg:

http://public.gi.ucsc.edu/~hickey/vg-patch/vg.5bc8f8a55f682d826d5276ae502e4cf6743d68b3

It is patched to not write lines greater than 2Gi in length (you will get a warning and it will otherwise disappear). Using this, you should at least get a valid VCF (please confirm if you are able to test).

This version of vg also has a new option vg deconstruct -L that merges alleles with sufficient similarity. If you run that on your graph, using say -L 0.9, it will merge together alleles that are >= 90% identical. Presumably, this will vastly reduce the number of alleles at your big site, allowing it to be represented in your VCF. I'm still testing this option out, but I think it is the only sane way for dealing with megabase-scale variation at this point.

The next vg release is scheduled for July 1 and will contain these options...

Han-Cao commented 22 hours ago

Thank you so much! The latest vg successfully generate a valid VCF.

I generated 3 VCFs:

  1. vg 1.56.0: vg_156.vcf.gz (invalid VCF)
  2. patched vg: vg_patched.vcf.gz (valid VCF)
  3. patched vg with -L 0.9: vg_patched.cluster.vcf.gz (valid VCF)
    vg.1.56 deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls | bgzip -c > vg_156.vcf.gz
    vg.patched deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls | bgzip -c > vg_patched.vcf.gz
    vg.patched deconstruct graph.gbz -P CHM13 -C -a -t 128 -O -r /graph.snarls -L 0.9 | bgzip -c > vg_patched.cluster.vcf.gz

    I can see warnings only when generating vg_patched.vcf.gz. It raised a warning exactly for the error record I reported in this issue chr13:72847648 with ID=>12942157>13476226. It is also very interesting to see another 2 giant alleles in the VCF, which I didn't notice before.

    Warning [vg deconstruct]: Skipping variant at chr6:140509080 with ID=>74056097>74056464 because its line length of 4782792297 exceeds vg's limit of 2000000000
    Warning [vg deconstruct]: Skipping variant at chr13:72847648 with ID=>12942157>13476226 because its line length of 5981954706 exceeds vg's limit of 2000000000
    Warning [vg deconstruct]: Skipping variant at chr3:164429312 with ID=>59766868>60188399 because its line length of 9946836003 exceeds vg's limit of 2000000000

I compared the file size and number of record of the 3 VCFs. Compared to vg 1.56.0, the patched version exactly drops the 3 large lines. After applying -L 0.9, the file size further decreased from 16G to 2.6G, while the number of records is very close, so it mainly merges similar alleles as expected.

vg_156.vcf.gz vg_patched.vcf.gz vg_patched.cluster.vcf.gz
File size 19G 16G 2.6G
No. of records 24982571 24982568 24982295

There is one minor issue with allele merging. Although vg deconstruct -L 0.9 didn't raise any warning, it doesn't keep the giant alleles in the VCF. I cannot find it by positions or ID. The largest alleles in those regions are much smaller.

bcftools query -f '%ALT' -r chr3:164400000-164500000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1 
879
bcftools query -f '%ALT' -r chr6:140500000-140600000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1 
638
bcftools query -f '%ALT' -r chr13:72800000-72900000 vg_patched.cluster.vcf.gz | awk '{print length}' | sort -n -r | head -n 1 
5701

Even so, the patched vg works well for my analysis. I can separately analyze the large SVs because they are very rare. I greatly appreciate your help and suggestions!