ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

diploid coverage drop-off #1353

Open glennhickey opened 2 months ago

glennhickey commented 2 months ago

On the apes, sex chromosome coverage is about 5% lower when aligning diploid (instead of haploid) versions of the same samples. Autosomes are about the same.

I'm trying to extract a small region that illustrates this from the real data. But in the meantime, here's a hack to reproduce the same effect using the test data, by adding a yeast chromosome as a dummy haplotype. The coverage drops >10% here, but it seems like an easier case -- because the other haplotypes are effectively unaligned, they shouldn't get int the way by fragmenting outgroups, breaking chains etc.

# download yeast
wget -q https://github.com/ComparativeGenomicsToolkit/cactusTestData/raw/master/yeast/S288C.fa.gz
samtools faidx S288C.fa.gz chrI | bgzip > yeast.fa.gz

#make the diploid seqfile
printf  "((simHuman_chr6:0.144018,((simMouse_chr6:0.001,simMouse_chrY:0.01):0.084509,(simRat_chr6:0.001,simRat_chrY:0.01):0.091589)mr:0.271974):0.020593,((simCow_chr6:0.001,simCow_chrY:0.01):0.18908,(simDog_chr6:0.001,simDog_chrY:0.01):0.16303):0.032898);\n \
simCow_chr6 https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simCow.chr6\n \
simCow_chrY yeast.fa.gz\n \
simDog_chr6 https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simDog.chr6\n \
simDog_chrY yeast.fa.gz\n \
simHuman_chr6 https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simHuman.chr6\n \
simMouse_chr6 https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simMouse.chr6\n \
simMouse_chrY yeast.fa.gz\n \
simRat_chr6 https://raw.githubusercontent.com/ComparativeGenomicsToolkit/cactusTestData/master/evolver/mammals/loci1/simRat.chr6\n \
simRat_chrY yeast.fa.gz\n \
" > evolverDiploid.txt

# make sure only leaf outgroups (this seems to help a bit)
cat src/cactus/cactus_progressive_config.xml | sed 's/greedyPreference/greedyLeaves/g' > config-leaves.xml

# do the diploid and haploid alignments
rm -rf js diploid.log ; cactus ./js evolverDiploid.txt diploid.hal  --logFile diploid.log --configFile config-leaves.xml 
rm -rf js haploid.log ; cactus ./js examples/evolverMammals.txt haploid.hal --logFile haploid.log --configFile config-leaves.xml

# compare the coverage
halStats haploid.hal --coverage simHuman_chr6  > haploid_coverage.txt
halStats diploid.hal --coverage simHuman_chr6  > diploid_coverage.txt

haploid

Genome, sitesCovered1Times, sitesCovered2Times
simHuman_chr6, 601863, 0
simCow_chr6, 463171, 0
simDog_chr6, 472185, 611
simMouse_chr6, 437295, 592
simRat_chr6, 431918, 573

diploid

Genome, sitesCovered1Times
simHuman_chr6, 601863
simCow_chr6, 401204
simDog_chr6, 414970
simMouse_chr6, 402774
simRat_chr6, 402741