ComparativeGenomicsToolkit / hal

Hierarchical Alignment Format
Other
160 stars 40 forks source link

corrupted hal files?, gaps in Snake visualization #242

Open tdlong opened 2 years ago

tdlong commented 2 years ago

Matthew:

Thanks as always!

I think it useful to have the discussion in the Google group (I find I use it quite a lot). So I will follow up here and submit a bug report.

Version = singularity run /dfs6/singularity-images/cactus-2.0.3.simg

Is there any way to look at the fly.hal file itself in order to determine if the problem is the Snake track's representation of the hal file or the hal file itself? HAL files are this beautiful thing (there is a lot of information in there!), but the tools for interacting with these files were perhaps never fully developed nor documented. As a user it is difficult to go in and look at them in a meaningful way. I get this as perhaps only a dozen or so people in the world need ever interact with these files directly.

#########

I have gone back and rerun the progressive cactus pipeline all the way through assembly hub creation to make sure I can replicate the problem. The problem happens whether or not in include the --lod switch in the hal2assemblyhub. So perhaps it is the hal file itself.

module load singularity/3.7.2 CAC="singularity run /dfs6/singularity-images/cactus-2.0.3.simg"

Preprocessor

$CAC cactus-preprocess jobstore_prepare_2/0 genomes/fly.txt steps-output_2/fly.txt --inputNames dm6 A1 A2 --realTimeLogging --logInfo --retryCount 0 $CAC cactus-preprocess jobstore_prepare_2/1 genomes/fly.txt steps-output_2/fly.txt --inputNames A3 A4 A5 --realTimeLogging --logInfo --retryCount 0 $CAC cactus-preprocess jobstore_prepare_2/2 genomes/fly.txt steps-output_2/fly.txt --inputNames A6 A7 AB8 --realTimeLogging --logInfo --retryCount 0 $CAC cactus-preprocess jobstore_prepare_2/3 genomes/fly.txt steps-output_2/fly.txt --inputNames B1 B2 B3 --realTimeLogging --logInfo --retryCount 0 $CAC cactus-preprocess jobstore_prepare_2/4 genomes/fly.txt steps-output_2/fly.txt --inputNames B4 B6 B7 --realTimeLogging --logInfo --retryCount 0 $CAC cactus-preprocess jobstore_prepare_2/5 genomes/fly.txt steps-output_2/fly.txt --inputNames OreR --realTimeLogging --logInfo --retryCount 0

Alignment

Round 0

$CAC cactus-blast jobstore_prepare_2/6 steps-output_2/fly.txt steps-output_2/Anc0.cigar --root Anc0 --realTimeLogging --logInfo --retryCount 0 --maxCores 32 $CAC cactus-align jobstore_prepare_2/7 steps-output_2/fly.txt steps-output_2/Anc0.cigar steps-output_2/fly.hal --root Anc0 --realTimeLogging --logInfo --retryCount 0 --maxCores 8 $CAC hal2fasta steps-output_2/fly.hal Anc0 --hdf5InMemory > steps-output_2/Anc0.fa $CAC hal2assemblyHub.py jobstore3 steps-output_2/fly.hal flyAssemblyHub_nolod --maxCores 16 $CAC hal2assemblyHub.py jobstore3 steps-output_2/fly.hal flyAssemblyHub_wlod --lod --maxCores 16

#########

Here is a UCSC browser session I'd like to share with you: http://genome.ucsc.edu/s/tdlong%40uci.edu/bug_report

This illustrates the missing information

There is a direct link to the hal here:

http://wfitch.bio.uci.edu/~tdlong/SantaCruzTracks/flyAssemblyHub_nolod/fly.hal

######### reply from help desk ##########

On Apr 13, 2022, at 2:26 PM, Matthew Speir mspeir@ucsc.edu wrote:

Hi, Tony.

Thank you for your question about snake tracks in the UCSC Genome Browser.

From talking with one of our collaborators on the Cactus project, it sounds like this is an issue with the LODs. The LODs work by skipping small alignment blocks as you zoom out, which can lead to gaps in the display. They recommended not using the LOD file and specifying the HAL file directly as well as filing a bug ticket here: https://github.com/ComparativeGenomicsToolkit/hal/.

I hope this is helpful. If you have any further questions, please reply to genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Group. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

Matthew Speir UCSC Cell Browser, Quality Assurance and Data Wrangler Human Cell Atlas, User Experience Researcher UCSC Genome Browser, User Support UC Santa Cruz Genomics Institute Revealing life’s code.

On Fri, Apr 8, 2022 at 3:05 PM Tony [tdlong@uci.edu](mailto:tdlong@uci.edu) wrote: All:

I built an assembly hub using something like:

hal2assemblyHub.py jobstore xx.hal AssemblyHub --lod --maxCores 8

There are big gaps in my browser, but my assemblies are excellent (and this happens throughout the genome). Do I have to manually interpolate the bins differently??

Thanks.

tdlong commented 2 years ago

the fly.txt file pointing to the assemblies looks like this:

*dm6 genomes/dm6.fa A1 genomes/a1.scaffold.fasta A2 genomes/a2.scaffold.fasta ...

But my assemblies are chromosome length scaffolds (~25Mb each in flies). Could this be the problem, should I be specifying these other genomes using something like:

dm6 genomes/dm6.fa A1 genomes/a1.scaffold.fasta *A2 genomes/a2.scaffold.fasta

The role of the "*" is not really defined anywhere...

tdlong commented 2 years ago

/share/adl/tdlong/DSPR/PCfly/PC2022/flyAssemblyHub_3_wlod module load singularity/3.7.2 CAC="singularity run /dfs6/singularity-images/cactus-2.0.3.simg" $CAC hal2maf fly.hal fly.maf --refGenome dm6 --refTargets blah.bed blah.bed = chr3L 8638711 8640288

cat fly.maf ... s B3.3L 8692801 354 + 28367679 CTAAGTTAACAGCCAGTCAGGCGTGTTAATACCGAGTGATTCTAGTCATGGAATGTAATAGGTATTAAGAATGTTTTAATTAATAAAAGTGACTTTGATTGGATGAAAGAAATGTGTGTGCATTATGTGTTATAAATATAAGCAAGCTAATTTACGTA----TGTCTGAATCAGATTTTTATATTTCAAGTGTACTTATACGACGTGGAACTATTGTG TAATTTACAGGCATTTGCACCTGAAATTTACGCTTAAAGTGCAATATATTTTGAATCGTACATTTCAACATGCCGACAACGTAATTGTATCACAATAATTTGAATATGACGAGCTCTTTTGTCTGTCAGCCTGCCTTCTT s B4.3L 8686318 358 + 28088621 CTAAGTTAACAGCCAGTCAGGCGTGTTAATACCGAGTGATTCTAGTCATGGAATGTAATAGGTATTAAGAATGTTTTAATTAATAAAAGTGACTTTGATTGGATGAAAGAAATGTGTGTGCATTATGTGTTATAAATATAAGCAAGCTAATTTACGTACATATGTATGAATCAGATTTTTATATATCAAGTGTGATCATATGACGTGGAACTATTATG TAATTTAAAGGCATTTGCACCTGAAATTTACGCTTAAAGTGCAATATATTTTGAATCGTACATTTCAACATGCCGACAACGTAATTGTATCACAATAATTTGAATATGACGAGCTCTTTTGTCTGTCAGCCTGCCTTCTT s B7.3L 8705210 358 + 28555313 CTAAGTTAACAGCCAGTCAGGCGTGTTAATACCGAGTGATTCTAGTCATGGAATGTAATAGGTATTAAGAATGTTTTAATTAATAAAAGTGACTTTGATTGGATGAAAGAAATGTGTGTGCATTATGTGTTATAAATATAAGCAAGCTAATTTACGTACATATGTATGAATCAGATTTTTATATATCAAGTGTGATCATATGACGTGGAACTATTATG TAATTTAAAGGCATTTGCACCTGAAATTTACGCTTAAAGTGCAATATATTTTGAATCGTACATTTCAACATGCCGACAACGTAATTGTATCACAATAATTTGAATATGACGAGCTCTTTTGTCTGTCAGCCTGCCTTCTT ...

B6.3L is totally missing from the alignments for this region ... consistent with the visual of the SNAKE track so I conclude the problem is the maf file.

But it is totally there in the raw fasta genome that was aligned. cat genomes_3/b6.scaffold.fasta | fold -w 100 | grep TTAAAACATTAAGTAGGGCCAAAC

glennhickey commented 2 years ago

Hi. I don't think there's reason to believe that there is an error in the browser display or that your HAL file is corrupt. It seems that this expected alignment is missing. This is hard to debug from here, but one idea would be to use hal2maf referenced on B6 to see where, if anywhere, B6.3L is aligning.

tdlong commented 2 years ago

If I understand your request.

$CAC hal2maf fly.hal fly.B6.maf --refGenome B6 --refTargets blah2.bed This appears to be an interval in B6 with no sequence in the browser blah2.bed = 3L 8404100 8594000

It looks perfectly good, here is the output: http://wfitch.bio.uci.edu/~tdlong/fly.B6.maf.gz

Screen Shot 2022-04-26 at 12 07 25 PM

Again, these are near reference sequencing within a species, and these missing hunks are endemic

http://genome.ucsc.edu/s/tdlong%40uci.edu/troubleshoot_Ap26

I am trying rerunning progress cactus (not using cactus-prepare) to see if the problem goes away in the closer to default usage.

glennhickey commented 2 years ago

If you want to run another experiment, might I suggest a pairwise cactus alignment with B6 and dm6?

Also, I see your hal file is based on a star tree

(dm6:1,A1:1,A2:1,A3:1,A4:1,A5:1,A6:1,A7:1,AB8:1,B1:1,B2:1,B3:1,B4:1,B6:1,B7:1,OreR:1)Anc0;

which I'd expect to work reasonably well (sometimes alignments can be lost across several ancestral nodes), but we've done most of our testing and tuning on binary guide trees, so I don't know if that could be an issue.

tdlong commented 2 years ago

Sorry that it takes a couple of days to respond. I am doing experiments to try and figure out and reproduce the problem.

I can now reproduce the problem with a somewhat compact example. There is some weird transitive problem, as the alignments work fine if I only give PC a small region or two genomes.

This is just one example of the problem, I see many more. And these are genomes that have aligned perfectly in the past (with an older version of PC). It could be the version of cactus...but I have not re-run using older versions. (the computer people upgraded our cluster and the older version of cactus was incompatible).

I have extracted single chromosome arms for each assembly to reproduce the problem. Here are my observations:

  1. If I only use a small region for 16 strains (extracted via isPcr) the assembly is fine.
  2. If I only use dm6 and b6 the assembly is fine
  3. If I use all the samples and do hal2maf for a region ref=dm6 there is no b6
  4. If I use all the samples and do hal2maf for aligning region ref=b6 is does not map to dm6
  5. but in case number 4 ref=b6 maps to other strains that DO map to dm6

Here is the code to reproduce the problem

module load singularity/3.7.2
CAC="singularity run /dfs6/singularity-images/cactus-2.0.3.simg"
$CAC cactus jobStore genomes_5/fly.2L.txt fly.hal
# blah1.bed = chr2L 8680000 8710000
$CAC hal2maf fly.hal fly.dm6.maf --refGenome dm6 --refTargets blah1.bed
cat fly.dm6.maf | grep b6
# blah3.bed = 2L    8648002 8678002
$CAC hal2maf fly.hal fly.b6.maf --refGenome b6 --refTargets blah3.bed
cat fly.b6.maf | grep dm6

Here is the data/hal/log:

http://wfitch.bio.uci.edu/~tdlong/genomes_5.tar.gz http://wfitch.bio.uci.edu/~tdlong/fly.hal http://wfitch.bio.uci.edu/~tdlong/fly.log.gz

The tgz archive is 111M (16 genomes * 25Mb) and it took about 4.5 hours to run using 8 cores and peaked at 32G of memory.

tdlong commented 2 years ago

Glen:

I posted an update to the Comparative Genomics Toolkit a couple of weeks ago. I just wanted to make sure you saw it. The update should allow the problem to be reproduced with modest effort (including computational effort). If the problem reproduces I feel this should be elevated to a bug. The alignments are incorrect, and this is just once instance of many, progressive cactus has alway just worked in the past.

If it cannot be reproduced then it is my singularity image (cactus-2.0.3.simg) or something about my cluster. I think this would be a good outcome.

T.

glennhickey commented 2 years ago

@tdlong Just to confirm, have you tried aligning using a guide tree?

tdlong commented 2 years ago

No, I do not specify a tree. These are all genomes from within a species. So the “tree” is really a genealogy, and the topology of that genealogy could (and I guess will) vary as you slide through a genome.

In the past I have never specified a guide tree and progressive cactus has always just worked and generated these beautiful alignments across the entire genome. I wanted to update the alignments I did 2+ years ago because I had one additional assembled genome. I did the assemblies on our brand new cluster that has the singularity image installed … as the old cluster and progressive cactus install is gone. So I feel like something has changed in the code (or the singularity image is not portable).

With that example I gave you. You can see the problem. You could go in a in silico PCR the entire regions in each assembly and it is there and should be alignable. So I suspect this is a boneheaded bug. This is why I feel the first step is just reproducing it.

T.

On May 12, 2022, at 12:50 PM, Glenn Hickey @.***> wrote:

@tdlong https://urldefense.com/v3/__https://github.com/tdlong__;!!CzAuKJ42GuquVTTmVmPViYEvSg!P18PUNWDY4DJO0bcF_6kp6QTV_kz-XL-DrQwy0_BCqqi_EvutoX6E2swu4ZZbT9TvmvZ4RuzDYZKJpP7S6lxd50$ Just to confirm, have you tried aligning using a guide tree?

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1125363957__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!P18PUNWDY4DJO0bcF_6kp6QTV_kz-XL-DrQwy0_BCqqi_EvutoX6E2swu4ZZbT9TvmvZ4RuzDYZKJpP7xKzWkUY$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2NZXZVT5TZCINBX4QXLVJVOHJANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!P18PUNWDY4DJO0bcF_6kp6QTV_kz-XL-DrQwy0_BCqqi_EvutoX6E2swu4ZZbT9TvmvZ4RuzDYZKJpP7VGqcOzA$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

Got it. I will try to reproduce here.

glennhickey commented 2 years ago

I can reprodce. Thanks for putting together a reduced dataset. For the record, I ran (with the current master https://github.com/ComparativeGenomicsToolkit/cactus/commit/8c56cbf8c54bf0fc215012775c79f8b3368055ec)

# Preprocessor
cactus-preprocess ./jobstore/0 ./fly.2L.txt prep/fly.2L.txt --inputNames a1 a3 a5 a7 b1 b3 b6 dm6 a2 a4 a6 ab8 b2 b4 b7 ore --realTimeLogging --logInfo --retryCount 0 
## Alignment
cactus-blast ./jobstore/1 prep/fly.2L.txt prep/Anc0.cigar --root Anc0 --realTimeLogging --logInfo --retryCount 0 

Then I made a pairwise cigar with

grep 'id=12' prep/Anc0.cigar | grep 'id=14' > prep/Anc0.pair.cigar
grep 'id=12' prep/Anc0.cigar.secondary | grep 'id=14' > prep/Anc0.pair.cigar.secondary

Then aligned with

cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.cigar prep/default.hal --root Anc0 --realTimeLogging --logInfo
cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.pair.cigar prep/pair.hal --root Anc0 --realTimeLogging --logInfo

And if I look at the maf's in the region you metnion, I see the alignment in the pair verison but no the full

hal2maf prep/default.hal prep/default.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000
hal2maf prep/pair.hal prep/pair.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000

When I do things like turn off various CAF filters and/or BAR, I still don't see the alignment for either default or pair. This strongly implies that the alignment in question is not captured by lastz. In the pairwise case, the cactus base aligner (BAR) gets it, but not in the full case. Since you are sure this used to work, I think the most likely assumption is a regression in the base aligner. There are some slowish experiments I will run to try to see if this is the case.

As an aside: is this data public? I'm in the process of writing up the new "pangenome" mode for cactus which is designed for datasets like this (I will also test it in your data). The advantages are that it

Let me know if any of these areas interest you and you'd be willing to collaborate...

tdlong commented 2 years ago

Glen:

This is a very good and bad result. It is good because it means I am not crazy (at least not yet). It is bound to happen one day.

It is bad, because, this should work.

These data are public. Except for one genome, but it should be public, so you can treat the data as public, certainly there are no constraints on your using it.

Here is the paper:

https://www.nature.com/articles/s41467-019-12884-1 https://www.nature.com/articles/s41467-019-12884-1

I only shared a single chromosome with you so it is fast. And in fact it may be better to do the alignments that way and merge the HALs. I could easily share more. My goal was to share less, I would like a reproducible example that runs with the least data possible.

This would be a great dataset for testing your new ideas. Years ago we tried to force the alignments into a pre-release of "vg" ... but that seemed to crash it at the time. Structural variants in these assemblies create all sort of problems for graph-based ways of representing genomes. But I know you guys have been working hard on this.

T.

On May 13, 2022, at 11:32 AM, Glenn Hickey @.***> wrote:

I can reprodce. Thanks for putting together a reduced dataset. For the record, I ran (with the current master @.*** https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/cactus/commit/8c56cbf8c54bf0fc215012775c79f8b3368055ec__;!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbKnbSP8BQ$)

Preprocessor

cactus-preprocess ./jobstore/0 ./fly.2L.txt prep/fly.2L.txt --inputNames a1 a3 a5 a7 b1 b3 b6 dm6 a2 a4 a6 ab8 b2 b4 b7 ore --realTimeLogging --logInfo --retryCount 0

Alignment

cactus-blast ./jobstore/1 prep/fly.2L.txt prep/Anc0.cigar --root Anc0 --realTimeLogging --logInfo --retryCount 0 Then I made a pairwise cigar with

grep 'id=12' prep/Anc0.cigar | grep 'id=14' > prep/Anc0.pair.cigar grep 'id=12' prep/Anc0.cigar.secondary | grep 'id=14' > prep/Anc0.pair.cigar.secondary Then aligned with

cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.cigar prep/default.hal --root Anc0 --realTimeLogging --logInfo cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.pair.cigar prep/pair.hal --root Anc0 --realTimeLogging --logInfo And if I look at the maf's in the region you metnion, I see the alignment in the pair verison but no the full

hal2maf prep/default.hal prep/default.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000 hal2maf prep/pair.hal prep/pair.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000 When I do things like turn off various CAF filters and/or BAR, I still don't see the alignment for either default or pair. This strongly implies that the alignment in question is not captured by lastz. In the pairwise case, the cactus base aligner (BAR) gets it, but not in the full case. Since you are sure this used to work, I think the most likely assumption is a regression in the base aligner. There are some slowish experiments I will run to try to see if this is the case.

As an aside: is this data public? I'm in the process of writing up the new "pangenome" mode for cactus which is designed for datasets like this (I will also test it in your data). The advantages are that it

scales better with number of genomes (ie you could do star tree of hundreds of flies) can make a graph that can be used for mapping experiments can convert to VCF, which makes reasoning about variants easier. Let me know if any of these areas interest you and you'd be willing to collaborate...

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1126335918__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbK4qgMcn0$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2NZYQ3BGUWR3ODWMDEDVJ2N27ANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbKQvuyuQU$. You are receiving this because you were mentioned.

tdlong commented 2 years ago

Not sure if this helps but for sure it was working in 2018. The log file has some information that perhaps tells you the version

conf_kc_features: (atomic)(zlib) conf_kc_version: 1.2.76 (16.13) conf_kt_features: (epoll) conf_kt_version: 0.9.56 (2.19) conf_os_name: Linux

glennhickey commented 2 years ago

Thanks, are you able to share the hal from this 2018 version? That may speed up figuring where the regression happened, as I haven't had much luck so far.

The pangenome mode does get what looks like the expected alignment. Further, it seems to have (suspiciously) better coverage. When I run halStats --coverage dm6 on the hal from the latest cactus (all default):

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times
dm6, 23513712, 228181, 13173, 0, 0, 0
a1, 12236242, 24569, 0, 0, 0, 0
ab8, 11417078, 26983, 0, 0, 0, 0
b2, 11047837, 3013, 0, 0, 0, 0
a6, 10960833, 21535, 0, 0, 0, 0
a4, 10301532, 21312, 0, 0, 0, 0
a2, 11220426, 41777, 12793, 10736, 3663, 2247
a3, 11362571, 19976, 0, 0, 0, 0
a5, 11706288, 56566, 0, 0, 0, 0
b3, 9946003, 25101, 207, 0, 0, 0
b1, 11284655, 10733, 0, 0, 0, 0
a7, 11496279, 109878, 8368, 3097, 149, 0
b6, 10694472, 5961, 0, 0, 0, 0
b4, 10768995, 19234, 4849, 4845, 0, 0
b7, 9889045, 7413, 0, 0, 0, 0
ore, 10818704, 17804, 2774, 0, 0, 0

but the pangenome hal is double

Warning: --inMemory is obsolete, use --hdf5InMemory
Genome, sitesCovered1Times, sitesCovered2Times
dm6, 23513712, 0
a7, 22058687, 45108
a2, 22096938, 4595
ab8, 22181976, 2764
a6, 22036657, 4419
b3, 22194929, 10106
a1, 22375952, 5688
a3, 22178812, 1632
b1, 22091194, 0
b7, 22329198, 5690
a4, 22174204, 7991
b4, 22035136, 9700
b2, 22110620, 1163
b6, 22100039, 0
ore, 22147698, 931
a5, 22198881, 4065

with the pairwise version being lower with the regular hal

Genome, sitesCovered1Times, sitesCovered2Times
dm6, 23513712, 12236
b6, 8194493, 14598

so I think there's something even more fundamentally wrong than I'd previously thought. In any case, this does look like a good application for the pangenome pipeline, so I will build full genome graph with that while I'm trying to figure out what's going on with the default mode.

tdlong commented 2 years ago

I will look harder. Stupid me replaced the Hal with the new one I created. And I didn’t notice the problem right away until I started manually looking at alignments … I mean in the past the new Hal was always better than the old one. I was hoping I could just find a 4 year old install. Even then it was so strange I though it was a weird thing like a compile problem or something.

I agree it a good test dataset. The assemblies are quite good. And my extracting a single chromosome makes it run pretty quickly with even modest resources. Also it is nice that it is all one species.

We could try to shrink the length? I could try to find some unique sequence tag and pull out a 4Mb hunk. This could allow you to debug in near real time ?? Obviously I don’t understand what is happening in the background (except lots of lastz), so It is really difficult for me to wrap my head around.

Sent from my iPhone

On May 16, 2022, at 6:01 PM, Glenn Hickey @.***> wrote:

 Thanks, are you able to share the hal from this 2018 version? That may speed up figuring where the regression happened, as I haven't had much luck so far.

The pangenome mode does get what looks like the expected alignment. Further, it seems to have (suspiciously) better coverage. When I run halStats --coverage dm6 on the hal from the latest cactus (all default):

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times dm6, 23513712, 228181, 13173, 0, 0, 0 a1, 12236242, 24569, 0, 0, 0, 0 ab8, 11417078, 26983, 0, 0, 0, 0 b2, 11047837, 3013, 0, 0, 0, 0 a6, 10960833, 21535, 0, 0, 0, 0 a4, 10301532, 21312, 0, 0, 0, 0 a2, 11220426, 41777, 12793, 10736, 3663, 2247 a3, 11362571, 19976, 0, 0, 0, 0 a5, 11706288, 56566, 0, 0, 0, 0 b3, 9946003, 25101, 207, 0, 0, 0 b1, 11284655, 10733, 0, 0, 0, 0 a7, 11496279, 109878, 8368, 3097, 149, 0 b6, 10694472, 5961, 0, 0, 0, 0 b4, 10768995, 19234, 4849, 4845, 0, 0 b7, 9889045, 7413, 0, 0, 0, 0 ore, 10818704, 17804, 2774, 0, 0, 0 but the pangenome hal is double

Warning: --inMemory is obsolete, use --hdf5InMemory Genome, sitesCovered1Times, sitesCovered2Times dm6, 23513712, 0 a7, 22058687, 45108 a2, 22096938, 4595 ab8, 22181976, 2764 a6, 22036657, 4419 b3, 22194929, 10106 a1, 22375952, 5688 a3, 22178812, 1632 b1, 22091194, 0 b7, 22329198, 5690 a4, 22174204, 7991 b4, 22035136, 9700 b2, 22110620, 1163 b6, 22100039, 0 ore, 22147698, 931 a5, 22198881, 4065 with the pairwise version being lower with the regular hal

Genome, sitesCovered1Times, sitesCovered2Times dm6, 23513712, 12236 b6, 8194493, 14598 so I think there's something even more fundamentally wrong than I'd previously thought. In any case, this does look like a good application for the pangenome pipeline, so I will build full genome graph with that while I'm trying to figure out what's going on with the default mode.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

tdlong commented 2 years ago

Glenn:

Here is the most recent one I seem to have

https://wfitch.bio.uci.edu/~tdlong/SantaCruzTracks/DSPR2017/fly.hal

The enclosing folder is actually a full-on track hub with a "hub.txt" file.

On May 16, 2022, at 6:01 PM, Glenn Hickey @.***> wrote:

Thanks, are you able to share the hal from this 2018 version? That may speed up figuring where the regression happened, as I haven't had much luck so far.

The pangenome mode does get what looks like the expected alignment. Further, it seems to have (suspiciously) better coverage. When I run halStats --coverage dm6 on the hal from the latest cactus (all default):

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times dm6, 23513712, 228181, 13173, 0, 0, 0 a1, 12236242, 24569, 0, 0, 0, 0 ab8, 11417078, 26983, 0, 0, 0, 0 b2, 11047837, 3013, 0, 0, 0, 0 a6, 10960833, 21535, 0, 0, 0, 0 a4, 10301532, 21312, 0, 0, 0, 0 a2, 11220426, 41777, 12793, 10736, 3663, 2247 a3, 11362571, 19976, 0, 0, 0, 0 a5, 11706288, 56566, 0, 0, 0, 0 b3, 9946003, 25101, 207, 0, 0, 0 b1, 11284655, 10733, 0, 0, 0, 0 a7, 11496279, 109878, 8368, 3097, 149, 0 b6, 10694472, 5961, 0, 0, 0, 0 b4, 10768995, 19234, 4849, 4845, 0, 0 b7, 9889045, 7413, 0, 0, 0, 0 ore, 10818704, 17804, 2774, 0, 0, 0 but the pangenome hal is double

Warning: --inMemory is obsolete, use --hdf5InMemory Genome, sitesCovered1Times, sitesCovered2Times dm6, 23513712, 0 a7, 22058687, 45108 a2, 22096938, 4595 ab8, 22181976, 2764 a6, 22036657, 4419 b3, 22194929, 10106 a1, 22375952, 5688 a3, 22178812, 1632 b1, 22091194, 0 b7, 22329198, 5690 a4, 22174204, 7991 b4, 22035136, 9700 b2, 22110620, 1163 b6, 22100039, 0 ore, 22147698, 931 a5, 22198881, 4065 with the pairwise version being lower with the regular hal

Genome, sitesCovered1Times, sitesCovered2Times dm6, 23513712, 12236 b6, 8194493, 14598 so I think there's something even more fundamentally wrong than I'd previously thought. In any case, this does look like a good application for the pangenome pipeline, so I will build full genome graph with that while I'm trying to figure out what's going on with the default mode.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1128288994__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!LXUmZuPWTHRFoAZI0hqIEjkO6nAbB9FOaOgACGPncnULN0LYZxrAqA4rpgjWGPM7tq4mXRe8vg86LVGMx5oxO08$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N73YGU3S6SMNJLHO6TVKLVWRANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!LXUmZuPWTHRFoAZI0hqIEjkO6nAbB9FOaOgACGPncnULN0LYZxrAqA4rpgjWGPM7tq4mXRe8vg86LVGMF6hRXUs$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

Thanks. That helps a bit. I was hoping it would include the cactus commit (via halStats --metadata), but I guess Cactus only started doing that more recently. One thing I noticed is that the masking is different -ie here are way more lower-case characters in B6 in that old hal than the equivalent new one. Extracting the sequences from the old hal, and realigning with hte latest cactus gives better (but not perfect) coverage in the problem region. Do you have any idea why this might be the case -- ideally all input should be softmasked with repeatmasker.

One encouraging note, the latest development branch of cactus seems to work fine (for this case). It uses a completely different way of filtering the lastz alignments (which seems, contrary to my prior expectations, to be where the problem is arising). I don't know when it will make it into a release (hopefully soon), but I'll make sure to test it on this data again at that point (the current version has lots of known issues and seems to be dropping alignments on other species like a6). I think time is better spent getting this branch working well than continuing to find the regression point from 5 years ago.

Anyway, in the meantime I'm going to make and write up a pangenome alignment of the 12 assemblies as part of the cactus pangenome paper. This should be much cleaner, and I will share it with you when it's ready.

tdlong commented 2 years ago

On May 17, 2022, at 2:14 PM, Glenn Hickey @.***> wrote:

Thanks. That helps a bit. I was hoping it would include the cactus commit (via halStats --metadata), but I guess Cactus only started doing that more recently. One thing I noticed is that the masking is different -ie here are way more lower-case characters in B6 in that old hal than the equivalent new one. Extracting the sequences from the old hal, and realigning with hte latest cactus gives better (but not perfect) coverage in the problem region. Do you have any idea why this might be the case -- ideally all input should be softmasked with repeatmasker.

At some point I concluded this did not matter. Maybe that changed. Clearly I can fix that.

One encouraging note, the latest development branch https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/cactus/tree/paf_chains__;!!CzAuKJ42GuquVTTmVmPViYEvSg!IhwbgkN8CjOvbdSa05ylve5V0Imr5qyFrtIt_xSOw6qxoGusw_aNkFoIUZBtl4ShE51dHMyjY0vMzUG2lPCnmLc$ of cactus seems to work fine (for this case). It uses a completely different way of filtering the lastz alignments (which seems, contrary to my prior expectations, to be where the problem is arising). I don't know when it will make it into a release (hopefully soon), but I'll make sure to test it on this data again at that point (the current version has lots of known issues and seems to be dropping alignments on other species like a6). I think time is better spent getting this branch working well than continuing to find the regression point from 5 years ago.

For sure! Especially if you are seeing the problem elsewhere (and you have to be). Just in this case of within species it jumps off the page a little more obviously.

Anyway, in the meantime I'm going to make and write up a pangenome alignment of the 12 assemblies as part of the cactus pangenome paper. This should be much cleaner, and I will share it with you when it's ready.

I will need you to explain this to me. My understanding is that the new pan genome approach gives me everything I had before (and more). Clearly from the point of view of the track-hubs I host (for the community interested in variation within flies), it is pretty important to have a Hal file so they can browse variation if they are gene focused.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1129324080__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!IhwbgkN8CjOvbdSa05ylve5V0Imr5qyFrtIt_xSOw6qxoGusw_aNkFoIUZBtl4ShE51dHMyjY0vMzUG2iepKWmc$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N5PV2KEU7JLRFFWLULVKQD47ANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!IhwbgkN8CjOvbdSa05ylve5V0Imr5qyFrtIt_xSOw6qxoGusw_aNkFoIUZBtl4ShE51dHMyjY0vMzUG2_sk0_BQ$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

Here is a (very preliminary) hal from the new approach: http://public.gi.ucsc.edu/~hickey/debug/fly-pg.hal Annoyingly, it adds ".0" to all the genome names except dm6, but this is fairly easily fixed (ie with halRenameGenomes). Likewise there is a virtual "MINIGRAPH" genome in there that can be safely removed (halRemoveGenome).

The main difference with regular cactus is that it keeps a reference genome (dm6 here) uncollapsed. This makes variants easier to reason about as each site on the reference corresponds to a unique location in the alignment/graph. This comes at the cost of paralogous alignments (ie segdeupes) not being directly represented. This comes out in the coverage stats, where B6 has about 1mb less coverage with the pangenome mode:

pangenome coverage:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times, sitesCovered35Times, sitesCovered36Times, sitesCovered37Times, sitesCovered38Times, sitesCovered39Times, sitesCovered40Times, sitesCovered41Times, sitesCovered42Times, sitesCovered43Times, sitesCovered44Times, sitesCovered45Times, sitesCovered46Times, sitesCovered47Times, sitesCovered48Times, sitesCovered49Times, sitesCovered50Times
dm6, 143221210, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B4.0, 124228214, 393823, 92418, 65856, 43950, 32022, 24741, 17001, 15238, 12889, 10342, 6986, 4090, 858, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A2.0, 123720389, 4295095, 222853, 51246, 33387, 18682, 11938, 9475, 8427, 6818, 5633, 5265, 1975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B6.0, 124630181, 1317027, 119448, 68273, 48901, 37279, 28326, 16813, 7413, 809, 381, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A6.0, 123358333, 556219, 53676, 23696, 14462, 7877, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
AB8.0, 125467578, 526404, 128809, 70272, 44368, 30049, 21089, 18594, 18208, 17572, 15857, 10968, 10802, 10488, 10053, 9388, 8869, 8337, 7966, 7726, 6714, 6680, 1493, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B3.0, 124177497, 334475, 102589, 35099, 13960, 8168, 3892, 1056, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
_MINIGRAPH_, 142533906, 303545, 96080, 46878, 25252, 14429, 7199, 2270, 257, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A1.0, 124845251, 745861, 116697, 49850, 32775, 27948, 21313, 19757, 18183, 16841, 14906, 7842, 5804, 1149, 1142, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A5.0, 124859260, 546295, 101864, 68153, 26237, 21361, 16795, 13163, 7777, 2221, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1079, 984, 984, 984, 980, 963, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B2.0, 124546786, 378546, 62159, 29507, 19365, 11570, 9088, 7347, 4877, 2555, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
OreR.0, 123882646, 1455901, 110977, 41297, 26536, 18860, 16837, 10593, 9540, 7626, 6886, 2753, 1149, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A4.0, 125902970, 728215, 102442, 38775, 31947, 23603, 20789, 19429, 19257, 18182, 16511, 16230, 15223, 14173, 12430, 12253, 11649, 11344, 11344, 11282, 11005, 11005, 11004, 10989, 10975, 10821, 10643, 10514, 10414, 9865, 9830, 9744, 9632, 9546, 9516, 9172, 7919, 7746, 7099, 6242, 5462, 5319, 5244, 4994, 4804, 4799, 4793, 3782, 1145, 1139
A7.0, 123988573, 8054250, 192534, 35374, 22160, 16626, 6701, 2023, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A3.0, 123724180, 297536, 73561, 29458, 13545, 11026, 10495, 10015, 8596, 5319, 643, 402, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B1.0, 123550399, 985626, 572795, 27539, 18301, 12006, 10063, 9216, 7914, 5636, 4454, 3090, 1934, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

same stats on your old hal:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times
dm6, 143726002, 11905239, 5693833, 3535963, 2421423, 1880938, 1531096, 1189748, 898476, 706578, 465678, 323140, 248956, 212010, 180328, 169498, 158682, 152902, 136954, 120823, 107063, 70901, 57701, 51468, 48156, 25281, 18755, 16919, 15939, 12053, 6773, 2557, 2013, 0
A1, 126420711, 9043144, 4131729, 2645113, 1862741, 1493233, 1112344, 916228, 785172, 649309, 504080, 366763, 283341, 221869, 129939, 97933, 86690, 64552, 34683, 26975, 10638, 8736, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A7, 125731185, 16021029, 4811886, 2919858, 2083845, 1598219, 1298722, 903386, 709684, 560750, 433185, 338887, 253943, 185213, 127096, 87730, 68563, 51784, 49396, 45572, 41791, 35582, 35507, 35342, 34572, 30934, 15524, 4474, 0, 0, 0, 0, 0, 0
B1, 124326748, 8483814, 3935722, 2537502, 1770882, 1348895, 1013293, 830211, 675049, 497174, 385168, 256642, 147304, 113992, 46774, 26481, 18366, 14245, 9590, 7975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
OreR, 125699354, 9182860, 4105406, 2511447, 1740269, 1308313, 993799, 850781, 697100, 599472, 502379, 404611, 328433, 304765, 244465, 206394, 147217, 105877, 93195, 71862, 64058, 32566, 26585, 25695, 20481, 20476, 20434, 19757, 18441, 11694, 0, 0, 0, 0
A2, 125387584, 12444468, 4892956, 3031279, 2038738, 1429729, 1109064, 902962, 679405, 557315, 473947, 380650, 289019, 221140, 167273, 148579, 132500, 113512, 90821, 55700, 11880, 6118, 5920, 3018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
B6, 125756734, 9221871, 4165900, 2715288, 1889174, 1468449, 1080350, 864650, 724456, 606872, 493260, 389082, 303702, 234250, 173725, 135678, 108057, 84845, 58777, 44911, 10308, 5634, 5634, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
AB8, 126447727, 8746717, 4073812, 2684098, 1941119, 1430699, 1052031, 856799, 682567, 489184, 406084, 352751, 269525, 240814, 188740, 120482, 112936, 107249, 103235, 84456, 81851, 81642, 79544, 75524, 60172, 51141, 40128, 22816, 4211, 0, 0, 0, 0, 0
A4, 126466900, 8982564, 4270849, 2736124, 1975073, 1473376, 969119, 761904, 628855, 523556, 410914, 310780, 238245, 199478, 175075, 123061, 103053, 66165, 63273, 55512, 53535, 49824, 41335, 30505, 22944, 9944, 3192, 0, 0, 0, 0, 0, 0, 0
A6, 124900416, 8560316, 4051235, 2706189, 1900797, 1390177, 950909, 757923, 561043, 371172, 235931, 166204, 108446, 75777, 67229, 47172, 35636, 31397, 29660, 27015, 20206, 19542, 6520, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
A3, 125167865, 8339572, 4034065, 2649409, 1919031, 1413224, 1002376, 798838, 612567, 527591, 390060, 275994, 209222, 141288, 101025, 74166, 52806, 27405, 10443, 6197, 6197, 6197, 6197, 6197, 6180, 6180, 5668, 5651, 5597, 0, 0, 0, 0, 0
A5, 125794629, 8670206, 4232586, 2800216, 1985280, 1527604, 1129919, 859920, 714749, 582414, 467555, 383480, 243164, 178693, 122226, 72472, 42151, 38191, 38179, 38179, 38035, 37990, 37902, 37288, 37182, 31524, 31432, 29628, 18211, 9338, 4198, 0, 0, 0
B2, 125523254, 8459565, 4089858, 2580985, 1840208, 1365776, 1038939, 773301, 641058, 497014, 381615, 278496, 170462, 139164, 94728, 91842, 85835, 42835, 38147, 38140, 37572, 37454, 31793, 31694, 31663, 31598, 31558, 31478, 31456, 31425, 31296, 29086, 21862, 4020
B4, 125749162, 8546919, 4075977, 2578909, 1886715, 1405123, 1073114, 856086, 709474, 532102, 367462, 243548, 144899, 99952, 60282, 36153, 33081, 25068, 25016, 23963, 6197, 6197, 6197, 6197, 6197, 6197, 6197, 5651, 5580, 3240, 0, 0, 0, 0
B3, 125836100, 8419947, 4111286, 2674598, 2003047, 1525212, 1146585, 904135, 728243, 611712, 435560, 359883, 214218, 165624, 117299, 98200, 92992, 89328, 60234, 58993, 54165, 42028, 28750, 7830, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

I haven't checked your hal, but this new one is split by reference chromosome (this is now an option in the pangenome pipeline). So no contig from any other genome is allowed to align to more than one dm6 contig.

tdlong commented 2 years ago

Glenn:

Great. I will try and look at this ASAP. I of course will want to look at these manually.

I somehow managed to break my trackhubs, by upgrading my web server to SSL late last week. And I further managed to break my hal2assemblyhub install, when the server people migrated all my files from a file server on the cluster I owned to a pay-per-TB condo model. Updating my server and storage were both good things, and I should be able to resolve these little hiccups over the next few days.

These commands seem useful. Is there a list of hal commands somewhere? The git seems to only describe a subset. Is there a way for example to merge two hal files (if, say for example, I decided to create hal files chromosome by chromosome).

T.

On May 18, 2022, at 2:08 PM, Glenn Hickey @.***> wrote:

Here is a (very preliminary) hal from the new approach: http://public.gi.ucsc.edu/~hickey/debug/fly-pg.hal https://urldefense.com/v3/__http://public.gi.ucsc.edu/*hickey/debug/fly-pg.hal__;fg!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lku3ub12s$ Annoyingly, it adds ".0" to all the genome names except dm6, but this is fairly easily fixed (ie with halRenameGenomes). Likewise there is a virtual "MINIGRAPH" genome in there that can be safely removed (halRemoveGenome).

The main difference with regular cactus is that it keeps a reference genome (dm6 here) uncollapsed. This makes variants easier to reason about as each site on the reference corresponds to a unique location in the alignment/graph. This comes at the cost of paralogous alignments (ie segdeupes) not being directly represented. This comes out in the coverage stats, where B6 has about 1mb less coverage with the pangenome mode:

pangenome coverage:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times, sitesCovered35Times, sitesCovered36Times, sitesCovered37Times, sitesCovered38Times, sitesCovered39Times, sitesCovered40Times, sitesCovered41Times, sitesCovered42Times, sitesCovered43Times, sitesCovered44Times, sitesCovered45Times, sitesCovered46Times, sitesCovered47Times, sitesCovered48Times, sitesCovered49Times, sitesCovered50Times dm6, 143221210, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B4.0, 124228214, 393823, 92418, 65856, 43950, 32022, 24741, 17001, 15238, 12889, 10342, 6986, 4090, 858, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A2.0, 123720389, 4295095, 222853, 51246, 33387, 18682, 11938, 9475, 8427, 6818, 5633, 5265, 1975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B6.0, 124630181, 1317027, 119448, 68273, 48901, 37279, 28326, 16813, 7413, 809, 381, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A6.0, 123358333, 556219, 53676, 23696, 14462, 7877, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 AB8.0, 125467578, 526404, 128809, 70272, 44368, 30049, 21089, 18594, 18208, 17572, 15857, 10968, 10802, 10488, 10053, 9388, 8869, 8337, 7966, 7726, 6714, 6680, 1493, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B3.0, 124177497, 334475, 102589, 35099, 13960, 8168, 3892, 1056, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 MINIGRAPH, 142533906, 303545, 96080, 46878, 25252, 14429, 7199, 2270, 257, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A1.0, 124845251, 745861, 116697, 49850, 32775, 27948, 21313, 19757, 18183, 16841, 14906, 7842, 5804, 1149, 1142, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A5.0, 124859260, 546295, 101864, 68153, 26237, 21361, 16795, 13163, 7777, 2221, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1079, 984, 984, 984, 980, 963, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B2.0, 124546786, 378546, 62159, 29507, 19365, 11570, 9088, 7347, 4877, 2555, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 OreR.0, 123882646, 1455901, 110977, 41297, 26536, 18860, 16837, 10593, 9540, 7626, 6886, 2753, 1149, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A4.0, 125902970, 728215, 102442, 38775, 31947, 23603, 20789, 19429, 19257, 18182, 16511, 16230, 15223, 14173, 12430, 12253, 11649, 11344, 11344, 11282, 11005, 11005, 11004, 10989, 10975, 10821, 10643, 10514, 10414, 9865, 9830, 9744, 9632, 9546, 9516, 9172, 7919, 7746, 7099, 6242, 5462, 5319, 5244, 4994, 4804, 4799, 4793, 3782, 1145, 1139 A7.0, 123988573, 8054250, 192534, 35374, 22160, 16626, 6701, 2023, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A3.0, 123724180, 297536, 73561, 29458, 13545, 11026, 10495, 10015, 8596, 5319, 643, 402, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B1.0, 123550399, 985626, 572795, 27539, 18301, 12006, 10063, 9216, 7914, 5636, 4454, 3090, 1934, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 same stats on your old hal:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times dm6, 143726002, 11905239, 5693833, 3535963, 2421423, 1880938, 1531096, 1189748, 898476, 706578, 465678, 323140, 248956, 212010, 180328, 169498, 158682, 152902, 136954, 120823, 107063, 70901, 57701, 51468, 48156, 25281, 18755, 16919, 15939, 12053, 6773, 2557, 2013, 0 A1, 126420711, 9043144, 4131729, 2645113, 1862741, 1493233, 1112344, 916228, 785172, 649309, 504080, 366763, 283341, 221869, 129939, 97933, 86690, 64552, 34683, 26975, 10638, 8736, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A7, 125731185, 16021029, 4811886, 2919858, 2083845, 1598219, 1298722, 903386, 709684, 560750, 433185, 338887, 253943, 185213, 127096, 87730, 68563, 51784, 49396, 45572, 41791, 35582, 35507, 35342, 34572, 30934, 15524, 4474, 0, 0, 0, 0, 0, 0 B1, 124326748, 8483814, 3935722, 2537502, 1770882, 1348895, 1013293, 830211, 675049, 497174, 385168, 256642, 147304, 113992, 46774, 26481, 18366, 14245, 9590, 7975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 OreR, 125699354, 9182860, 4105406, 2511447, 1740269, 1308313, 993799, 850781, 697100, 599472, 502379, 404611, 328433, 304765, 244465, 206394, 147217, 105877, 93195, 71862, 64058, 32566, 26585, 25695, 20481, 20476, 20434, 19757, 18441, 11694, 0, 0, 0, 0 A2, 125387584, 12444468, 4892956, 3031279, 2038738, 1429729, 1109064, 902962, 679405, 557315, 473947, 380650, 289019, 221140, 167273, 148579, 132500, 113512, 90821, 55700, 11880, 6118, 5920, 3018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B6, 125756734, 9221871, 4165900, 2715288, 1889174, 1468449, 1080350, 864650, 724456, 606872, 493260, 389082, 303702, 234250, 173725, 135678, 108057, 84845, 58777, 44911, 10308, 5634, 5634, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 AB8, 126447727, 8746717, 4073812, 2684098, 1941119, 1430699, 1052031, 856799, 682567, 489184, 406084, 352751, 269525, 240814, 188740, 120482, 112936, 107249, 103235, 84456, 81851, 81642, 79544, 75524, 60172, 51141, 40128, 22816, 4211, 0, 0, 0, 0, 0 A4, 126466900, 8982564, 4270849, 2736124, 1975073, 1473376, 969119, 761904, 628855, 523556, 410914, 310780, 238245, 199478, 175075, 123061, 103053, 66165, 63273, 55512, 53535, 49824, 41335, 30505, 22944, 9944, 3192, 0, 0, 0, 0, 0, 0, 0 A6, 124900416, 8560316, 4051235, 2706189, 1900797, 1390177, 950909, 757923, 561043, 371172, 235931, 166204, 108446, 75777, 67229, 47172, 35636, 31397, 29660, 27015, 20206, 19542, 6520, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A3, 125167865, 8339572, 4034065, 2649409, 1919031, 1413224, 1002376, 798838, 612567, 527591, 390060, 275994, 209222, 141288, 101025, 74166, 52806, 27405, 10443, 6197, 6197, 6197, 6197, 6197, 6180, 6180, 5668, 5651, 5597, 0, 0, 0, 0, 0 A5, 125794629, 8670206, 4232586, 2800216, 1985280, 1527604, 1129919, 859920, 714749, 582414, 467555, 383480, 243164, 178693, 122226, 72472, 42151, 38191, 38179, 38179, 38035, 37990, 37902, 37288, 37182, 31524, 31432, 29628, 18211, 9338, 4198, 0, 0, 0 B2, 125523254, 8459565, 4089858, 2580985, 1840208, 1365776, 1038939, 773301, 641058, 497014, 381615, 278496, 170462, 139164, 94728, 91842, 85835, 42835, 38147, 38140, 37572, 37454, 31793, 31694, 31663, 31598, 31558, 31478, 31456, 31425, 31296, 29086, 21862, 4020 B4, 125749162, 8546919, 4075977, 2578909, 1886715, 1405123, 1073114, 856086, 709474, 532102, 367462, 243548, 144899, 99952, 60282, 36153, 33081, 25068, 25016, 23963, 6197, 6197, 6197, 6197, 6197, 6197, 6197, 5651, 5580, 3240, 0, 0, 0, 0 B3, 125836100, 8419947, 4111286, 2674598, 2003047, 1525212, 1146585, 904135, 728243, 611712, 435560, 359883, 214218, 165624, 117299, 98200, 92992, 89328, 60234, 58993, 54165, 42028, 28750, 7830, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 I haven't checked your hal, but this new one is split by reference chromosome (this is now an option in the pangenome pipeline). So no contig from any other genome is allowed to align to more than one dm6 contig.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1130550870__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lk2XAyptA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N3Y54VJIGXYHLJXHVTVKVL6XANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lkHTvrNPU$. You are receiving this because you were mentioned.

diekhans commented 2 years ago

The browser group just asked me about the problem you reported.

Something is broken in your HTTPS configuration. If you use the curl command to try to download from HAL it fails. HAL uses libcurl to access it.

tdlong @.***> writes:

Glenn:

Great. I will try and look at this ASAP. I of course will want to look at these manually.

I somehow managed to break my trackhubs, by upgrading my web server to SSL late last week. And I further managed to break my hal2assemblyhub install, when the server people migrated all my files from a file server on the cluster I owned to a pay-per-TB condo model. Updating my server and storage were both good things, and I should be able to resolve these little hiccups over the next few days.

These commands seem useful. Is there a list of hal commands somewhere? The git seems to only describe a subset. Is there a way for example to merge two hal files (if, say for example, I decided to create hal files chromosome by chromosome).

T.

tdlong commented 2 years ago

Yes, I figured out how to get the error message using curl from the command line ( it was documented !!). It is now clear what the problem is and how to fix it!

Sent from my iPhone

On May 19, 2022, at 12:15 PM, Mark Diekhans @.***> wrote:



The browser group just asked me about the problem you reported.

Something is broken in your HTTPS configuration. If you use the curl command to try to download from HAL it fails. HAL uses libcurl to access it.

tdlong @.***> writes:

Glenn:

Great. I will try and look at this ASAP. I of course will want to look at these manually.

I somehow managed to break my trackhubs, by upgrading my web server to SSL late last week. And I further managed to break my hal2assemblyhub install, when the server people migrated all my files from a file server on the cluster I owned to a pay-per-TB condo model. Updating my server and storage were both good things, and I should be able to resolve these little hiccups over the next few days.

These commands seem useful. Is there a list of hal commands somewhere? The git seems to only describe a subset. Is there a way for example to merge two hal files (if, say for example, I decided to create hal files chromosome by chromosome).

T. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

@tdlong Your best bet to list all the tools is to install cactus, look in cactus/bin/hal*. There you can find halMergeChroms which can combine chromosome hals. hal2assemblyHub.py should also work in the cactus install provided you've set PYTHONPATH as described in the install instructions.

tdlong commented 2 years ago

Glenn:

I fixed my technical problems (they were totally independent of this, but a perfect storm none-the-less).

#########################################

I ran hal2assemblyhub and looked at a big zoom out (not shown). And some hard regions.

Here is a session for your new alignments.

http://genome.ucsc.edu/s/tdlong%40uci.edu/fly%2Dpgdm6 @.***/fly-pgdm6>

Here is the corresponding region with the old hal file

http://genome.ucsc.edu/s/tdlong%40uci.edu/old_fly @.***/old_fly>

I seems to look more or less the same.

On May 18, 2022, at 2:08 PM, Glenn Hickey @.***> wrote:

Here is a (very preliminary) hal from the new approach: http://public.gi.ucsc.edu/~hickey/debug/fly-pg.hal https://urldefense.com/v3/__http://public.gi.ucsc.edu/*hickey/debug/fly-pg.hal__;fg!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lku3ub12s$ Annoyingly, it adds ".0" to all the genome names except dm6, but this is fairly easily fixed (ie with halRenameGenomes). Likewise there is a virtual "MINIGRAPH" genome in there that can be safely removed (halRemoveGenome).

The main difference with regular cactus is that it keeps a reference genome (dm6 here) uncollapsed. This makes variants easier to reason about as each site on the reference corresponds to a unique location in the alignment/graph. This comes at the cost of paralogous alignments (ie segdeupes) not being directly represented. This comes out in the coverage stats, where B6 has about 1mb less coverage with the pangenome mode:

pangenome coverage:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times, sitesCovered35Times, sitesCovered36Times, sitesCovered37Times, sitesCovered38Times, sitesCovered39Times, sitesCovered40Times, sitesCovered41Times, sitesCovered42Times, sitesCovered43Times, sitesCovered44Times, sitesCovered45Times, sitesCovered46Times, sitesCovered47Times, sitesCovered48Times, sitesCovered49Times, sitesCovered50Times dm6, 143221210, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B4.0, 124228214, 393823, 92418, 65856, 43950, 32022, 24741, 17001, 15238, 12889, 10342, 6986, 4090, 858, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A2.0, 123720389, 4295095, 222853, 51246, 33387, 18682, 11938, 9475, 8427, 6818, 5633, 5265, 1975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B6.0, 124630181, 1317027, 119448, 68273, 48901, 37279, 28326, 16813, 7413, 809, 381, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A6.0, 123358333, 556219, 53676, 23696, 14462, 7877, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 AB8.0, 125467578, 526404, 128809, 70272, 44368, 30049, 21089, 18594, 18208, 17572, 15857, 10968, 10802, 10488, 10053, 9388, 8869, 8337, 7966, 7726, 6714, 6680, 1493, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B3.0, 124177497, 334475, 102589, 35099, 13960, 8168, 3892, 1056, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 MINIGRAPH, 142533906, 303545, 96080, 46878, 25252, 14429, 7199, 2270, 257, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A1.0, 124845251, 745861, 116697, 49850, 32775, 27948, 21313, 19757, 18183, 16841, 14906, 7842, 5804, 1149, 1142, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A5.0, 124859260, 546295, 101864, 68153, 26237, 21361, 16795, 13163, 7777, 2221, 1080, 1080, 1080, 1080, 1080, 1080, 1080, 1079, 984, 984, 984, 980, 963, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B2.0, 124546786, 378546, 62159, 29507, 19365, 11570, 9088, 7347, 4877, 2555, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 OreR.0, 123882646, 1455901, 110977, 41297, 26536, 18860, 16837, 10593, 9540, 7626, 6886, 2753, 1149, 1148, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A4.0, 125902970, 728215, 102442, 38775, 31947, 23603, 20789, 19429, 19257, 18182, 16511, 16230, 15223, 14173, 12430, 12253, 11649, 11344, 11344, 11282, 11005, 11005, 11004, 10989, 10975, 10821, 10643, 10514, 10414, 9865, 9830, 9744, 9632, 9546, 9516, 9172, 7919, 7746, 7099, 6242, 5462, 5319, 5244, 4994, 4804, 4799, 4793, 3782, 1145, 1139 A7.0, 123988573, 8054250, 192534, 35374, 22160, 16626, 6701, 2023, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A3.0, 123724180, 297536, 73561, 29458, 13545, 11026, 10495, 10015, 8596, 5319, 643, 402, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B1.0, 123550399, 985626, 572795, 27539, 18301, 12006, 10063, 9216, 7914, 5636, 4454, 3090, 1934, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 same stats on your old hal:

Genome, sitesCovered1Times, sitesCovered2Times, sitesCovered3Times, sitesCovered4Times, sitesCovered5Times, sitesCovered6Times, sitesCovered7Times, sitesCovered8Times, sitesCovered9Times, sitesCovered10Times, sitesCovered11Times, sitesCovered12Times, sitesCovered13Times, sitesCovered14Times, sitesCovered15Times, sitesCovered16Times, sitesCovered17Times, sitesCovered18Times, sitesCovered19Times, sitesCovered20Times, sitesCovered21Times, sitesCovered22Times, sitesCovered23Times, sitesCovered24Times, sitesCovered25Times, sitesCovered26Times, sitesCovered27Times, sitesCovered28Times, sitesCovered29Times, sitesCovered30Times, sitesCovered31Times, sitesCovered32Times, sitesCovered33Times, sitesCovered34Times dm6, 143726002, 11905239, 5693833, 3535963, 2421423, 1880938, 1531096, 1189748, 898476, 706578, 465678, 323140, 248956, 212010, 180328, 169498, 158682, 152902, 136954, 120823, 107063, 70901, 57701, 51468, 48156, 25281, 18755, 16919, 15939, 12053, 6773, 2557, 2013, 0 A1, 126420711, 9043144, 4131729, 2645113, 1862741, 1493233, 1112344, 916228, 785172, 649309, 504080, 366763, 283341, 221869, 129939, 97933, 86690, 64552, 34683, 26975, 10638, 8736, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A7, 125731185, 16021029, 4811886, 2919858, 2083845, 1598219, 1298722, 903386, 709684, 560750, 433185, 338887, 253943, 185213, 127096, 87730, 68563, 51784, 49396, 45572, 41791, 35582, 35507, 35342, 34572, 30934, 15524, 4474, 0, 0, 0, 0, 0, 0 B1, 124326748, 8483814, 3935722, 2537502, 1770882, 1348895, 1013293, 830211, 675049, 497174, 385168, 256642, 147304, 113992, 46774, 26481, 18366, 14245, 9590, 7975, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 OreR, 125699354, 9182860, 4105406, 2511447, 1740269, 1308313, 993799, 850781, 697100, 599472, 502379, 404611, 328433, 304765, 244465, 206394, 147217, 105877, 93195, 71862, 64058, 32566, 26585, 25695, 20481, 20476, 20434, 19757, 18441, 11694, 0, 0, 0, 0 A2, 125387584, 12444468, 4892956, 3031279, 2038738, 1429729, 1109064, 902962, 679405, 557315, 473947, 380650, 289019, 221140, 167273, 148579, 132500, 113512, 90821, 55700, 11880, 6118, 5920, 3018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 B6, 125756734, 9221871, 4165900, 2715288, 1889174, 1468449, 1080350, 864650, 724456, 606872, 493260, 389082, 303702, 234250, 173725, 135678, 108057, 84845, 58777, 44911, 10308, 5634, 5634, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 AB8, 126447727, 8746717, 4073812, 2684098, 1941119, 1430699, 1052031, 856799, 682567, 489184, 406084, 352751, 269525, 240814, 188740, 120482, 112936, 107249, 103235, 84456, 81851, 81642, 79544, 75524, 60172, 51141, 40128, 22816, 4211, 0, 0, 0, 0, 0 A4, 126466900, 8982564, 4270849, 2736124, 1975073, 1473376, 969119, 761904, 628855, 523556, 410914, 310780, 238245, 199478, 175075, 123061, 103053, 66165, 63273, 55512, 53535, 49824, 41335, 30505, 22944, 9944, 3192, 0, 0, 0, 0, 0, 0, 0 A6, 124900416, 8560316, 4051235, 2706189, 1900797, 1390177, 950909, 757923, 561043, 371172, 235931, 166204, 108446, 75777, 67229, 47172, 35636, 31397, 29660, 27015, 20206, 19542, 6520, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 A3, 125167865, 8339572, 4034065, 2649409, 1919031, 1413224, 1002376, 798838, 612567, 527591, 390060, 275994, 209222, 141288, 101025, 74166, 52806, 27405, 10443, 6197, 6197, 6197, 6197, 6197, 6180, 6180, 5668, 5651, 5597, 0, 0, 0, 0, 0 A5, 125794629, 8670206, 4232586, 2800216, 1985280, 1527604, 1129919, 859920, 714749, 582414, 467555, 383480, 243164, 178693, 122226, 72472, 42151, 38191, 38179, 38179, 38035, 37990, 37902, 37288, 37182, 31524, 31432, 29628, 18211, 9338, 4198, 0, 0, 0 B2, 125523254, 8459565, 4089858, 2580985, 1840208, 1365776, 1038939, 773301, 641058, 497014, 381615, 278496, 170462, 139164, 94728, 91842, 85835, 42835, 38147, 38140, 37572, 37454, 31793, 31694, 31663, 31598, 31558, 31478, 31456, 31425, 31296, 29086, 21862, 4020 B4, 125749162, 8546919, 4075977, 2578909, 1886715, 1405123, 1073114, 856086, 709474, 532102, 367462, 243548, 144899, 99952, 60282, 36153, 33081, 25068, 25016, 23963, 6197, 6197, 6197, 6197, 6197, 6197, 6197, 5651, 5580, 3240, 0, 0, 0, 0 B3, 125836100, 8419947, 4111286, 2674598, 2003047, 1525212, 1146585, 904135, 728243, 611712, 435560, 359883, 214218, 165624, 117299, 98200, 92992, 89328, 60234, 58993, 54165, 42028, 28750, 7830, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 I haven't checked your hal, but this new one is split by reference chromosome (this is now an option in the pangenome pipeline). So no contig from any other genome is allowed to align to more than one dm6 contig.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1130550870__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lk2XAyptA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N3Y54VJIGXYHLJXHVTVKVL6XANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!JOVQxg3ODnAf7Jxe4SkhXWpJn2XMPECDMH3bqMZBFR8ABEW-QRtQoRCB_9HeB1fOB5xtW05y8clRa3lkHTvrNPU$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

Thanks for checking! And it's also a relief to see how much cleaner the new alignment looks, at least on the region your sessions are showing. It really looks like the old version is aligning the wrong copy of a segdupe or something like that, so I (and I of course can't say for sure), I'd definitely be more confident in the new one, both for the A4 gap size, and the conflicts when you zoom in.

I'm still tuning one last parameter, and will make a new cactus release (and post a corresponding hal here) once that's done -- should be this week.

Do you have any suggestsions of public short-read datasets for melagonaster samples that are not in the present dataset?

tdlong commented 2 years ago

Glenn:

A few things.

  1. This is great. It is important it is working for within species alignments. I am excited for a new release.

  2. I agree there are some better aspects of the newer representation.

  3. Oddly B7 seems to be missing from the hal you gave me the link for. It was in the tar archive?? I don't know how it was dropped?

http://wfitch.bio.uci.edu/~tdlong/genomes_5.tar.gz http://wfitch.bio.uci.edu/~tdlong/genomes_5.tar.gz

module load hdf5/1.10.5/gcc.4.8.5 h5ls fly-pg.hal

Doesn't show B7

  1. Remember, tI just extracted a single chromosome arm (=chr2L). Drosophila have 5 such arms roughly equal in size that account for something like 98% of their genome (in terms of gene content) = X, 2L, 2R, 3L, 3R. So we ran progressive cactus ~20% of the genome. In my hands this resulted in a greatly accelerated progressive cactus run, so I felt it more appropriate to share than the entire genome. (This would be easy to change!).

  2. There are lots of short read datasets in the public domain. The fly community is really good about this. This is a widely used collection of 200 strains that Trudy Mackay sequenced some years back call the "DGRP". These are ~200 sequenced inbred lines used for GWAS-type dissection of complex traits. (http://dgrp2.gnets.ncsu.edu/). It is not the only such collection, but is the most well-known and studied.

In what form would you want these data? Raw fastq.gz? I can work on getting you this. It is all in the public domain, but most people don't want raw data (just the "SNP table").

  1. But, if you did get raw fastq.gz, those reads would cover the entire genome (not just chr2L). This may be perfectly fine, but may be problematic if you want to make statements about %mappable reads. Although I guess for a paper you would need to compare to just dm6 (the reference) for chromosome 2L.

On May 23, 2022, at 6:42 AM, Glenn Hickey @.***> wrote:

Thanks for checking! And it's also a relief to see how much cleaner the new alignment looks, at least on the region your sessions are showing. It really looks like the old version is aligning the wrong copy of a segdupe or something like that, so I (and I of course can't say for sure), I'd definitely be more confident in the new one, both for the A4 gap size, and the conflicts when you zoom in.

I'm still tuning one last parameter, and will make a new cactus release (and post a corresponding hal here) once that's done -- should be this week.

Do you have any suggestsions of public short-read datasets for melagonaster samples that are not in the present dataset?

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1134696420__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!IsRU8OGfFdcp0x-iIc8LBvTv_XoDlNtxvfaksCWNea6vlqh0nIcelkJqtvemSQdSRsvVufgq8FPzpL5m-nFH_EM$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2NZBVYFQBXOHHFLNET3VLODLVANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!IsRU8OGfFdcp0x-iIc8LBvTv_XoDlNtxvfaksCWNea6vlqh0nIcelkJqtvemSQdSRsvVufgq8FPzpL5mJMHuWi8$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

I extracted the input sequences from https://wfitch.bio.uci.edu/~tdlong/SantaCruzTracks/DSPR2017/fly.hal using hal2fasta in order to make http://public.gi.ucsc.edu/~hickey/debug/fly-pg.hal. So that pangenome HAL contains all chromosomes, but not sample B7 (which wasn't in your original hal). I should have mentioned all that before!

And to clarify, moving forward I'm interested in full genome results. That new alignment I shared only took a couple hours on 32 cores to produce, end-to-end.

So this is also a good time to ask. Do you have links to the full assemblies, ideally repeatmasked (required only to compare with the standard cactus) with nice chromosome names, that I can use for creating a "publishable" alignment?

For the short reads, the DGRP dataset looks great. And raw fastq.gz would indeed be what I'd be looking for. The idea would be to map each sample to the pangenome graph in order to genotype structural variation.

tdlong commented 2 years ago

On May 23, 2022, at 11:15 AM, Glenn Hickey @.***> wrote:

I extracted the input sequences from https://wfitch.bio.uci.edu/~tdlong/SantaCruzTracks/DSPR2017/fly.hal https://wfitch.bio.uci.edu/~tdlong/SantaCruzTracks/DSPR2017/fly.hal using hal2fasta in order to make http://public.gi.ucsc.edu/~hickey/debug/fly-pg.hal https://urldefense.com/v3/__http://public.gi.ucsc.edu/*hickey/debug/fly-pg.hal__;fg!!CzAuKJ42GuquVTTmVmPViYEvSg!KHjRnsYtLahIF4KI_cfBBhTgJYzgcQAr8ixwoRx6kMSPeY3ZxYHYoMlZoHXH7wd5flM5uXziXkQqzIdXj42VYPE$. So that pangenome HAL contains all chromosomes, but not sample B7 (which wasn't in your original hal). I should have mentioned all that before!

I figured something like this. And to clarify, moving forward I'm interested in full genome results. That new alignment I shared only took a couple hours on 32 cores to produce, end-to-end.

So this is also a good time to ask. Do you have links to the full assemblies, ideally repeatmasked (required only to compare with the standard cactus) with nice chromosome names, that I can use for creating a "publishable" alignment?

The key citation for these assemblies is: https://pubmed.ncbi.nlm.nih.gov/31653862/

link to genomes = http://wfitch.bio.uci.edu/~tdlong/fly_genomes.tar.gz

dm6 is the standard reference genome (i.e, used by SCGB). iso1 is a resequenced version of the same strain used for the reference assembly (an assembly control in a way). dm6 and iso1 should be really really similar.

There are two files in the “fly_genomes” directory = fly.txt and softmask.fly.txt

They list the raw or soft masked genome files respectively. I left the other output from repeat masker in the directory, but you can likely ignore.

These sequenced strains are highly inbred, we think greater than 95% of the genome is homozygous in all strains and much greater in most.

For the short reads, the DGRP dataset looks great. And raw fastq.gz would indeed be what I'd be looking for. The idea would be to map each sample to the pangenome graph in order to genotype structural variation.

The key citation is this paper: https://genome.cshlp.org/content/24/7/1193.long https://genome.cshlp.org/content/24/7/1193.long

It includes a rather complex list of ~200 accessions as Supplementary Data table 1. I filtered the complete list to include "only" 100 lines. These are the lines that have >15X genome coverage, are represented as a single SRA accession, and only Illumina read technology. This was a Baylor genome project and back in the day they tended to throw multiple technologies at a problem. I assume you don't want the hassle of multiple accessions mapping to a single sample nor non-illumina data.

Even with this list the Illumina sequencing is a mix of read-lengths (albeit longer reads). I would rather work with a clean if abridged dataset. But let me know.

This spreadsheet will allow you to easily script downloading the raw data from SRA.

These strains were inbred before sequencing, but not as much as the strains used for de novo sequencing above. There is considerable variation in how well inbred they are. I guess 80% are more than 80% inbred. So for sure there are large Mb-sized hunks that are not homozygous in many samples. Several of the strains are have large inversions (that have been genotyped via PCR or cytology). Turns out that is a “thing” in flies. I could share those genotypes .. or you can see how you do...

###########

I will be interested in how you do. The idea of annotating SVs from short reads is appealing, but hard. Especially in flies many "structural variants" are TE insertions ... and the presence of a TE at any given location tends to be rare in a population — in many case these events are unique in samples of 50+ flies. So most TE insertion events in the 100 short read genomes above are unlikely to be present in the 15 sequenced genomes. We observed in our paper describing the de novo assembled genomes that this is more generally the case -- the majority of all structural variants have low population frequencies (we argued that they are often slightly deleterious so natural selection doesn't let them increase in frequency a whole lot) and are only in a single sequenced genome.

This is almost so true that if a population geneticists sees a TE inserted at the same location in several flies they immediately think one of 3 things.

  1. You are looking at a centromeric or telomeric heterochromatin regions (or Y), where deleterious things can reach high frequencies.
  2. The samples are in fact related to one another, and you are looking at a hunk of the genome inherited identical by descent (say the flies are 1st cousins).
  3. The TE is under natural selection

There are a handful of well known examples of TEs are high frequencies, in all cases I know of they are in insecticide resistance genes in flies!! (I guess farmers were spraying orchards in the past!).

T.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1134989858__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!KHjRnsYtLahIF4KI_cfBBhTgJYzgcQAr8ixwoRx6kMSPeY3ZxYHYoMlZoHXH7wd5flM5uXziXkQqzIdXhjzTs2k$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N2ZZPUMOXN3IFWIURDVLPDN5ANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!KHjRnsYtLahIF4KI_cfBBhTgJYzgcQAr8ixwoRx6kMSPeY3ZxYHYoMlZoHXH7wd5flM5uXziXkQqzIdXCxNDTxQ$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

First off, thanks so much for the tarball of assemblies. This looks exactly like what I need.

For the short reads: you refer to a spreadsheet of links, but I don't see it. If you included it as an email attachment then github probably ate it. Would it be possible for your to email me directly?

Thank you for the explanation on fly SVs (which I understand a bit more after reading your paper). I will temper my expectations from short-read genotyping accordingly, but will run a few samples ASAP to see if there's any signal at all.

tdlong commented 2 years ago

Send me your email, I do not think I have it

On May 24, 2022, at 1:15 PM, Glenn Hickey @.***> wrote:

First off, thanks so much for the tarball of assemblies. This looks exactly like what I need.

For the short reads: you refer to a spreadsheet of links, but I don't see it. If you included it as an email attachment then github probably ate it. Would it be possible for your to email me directly?

Thank you for the explanation on fly SVs (which I understand a bit more after reading your paper). I will temper my expectations from short-read genotyping accordingly, but will run a few samples ASAP to see if there's any signal at all.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1136389372__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!PrXgsb2WHGE9kYcuKUPlGjSLmZNcXz1ZaBehVqAAiTYlPB6TpY5L5uTpfMnhY7qxrZAHLZi5dpDlZADdqj13Jwk$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N2PDXEBEIJLVXJM4DDVLU2GZANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!PrXgsb2WHGE9kYcuKUPlGjSLmZNcXz1ZaBehVqAAiTYlPB6TpY5L5uTpfMnhY7qxrZAHLZi5dpDlZADd_XB3I5o$. You are receiving this because you were mentioned.

glennhickey commented 2 years ago

Sure, my github username, glennhickey, at gmail.com thanks

On Tue, May 24, 2022 at 5:06 PM tdlong @.***> wrote:

Send me your email, I do not think I have it

On May 24, 2022, at 1:15 PM, Glenn Hickey @.***> wrote:

First off, thanks so much for the tarball of assemblies. This looks exactly like what I need.

For the short reads: you refer to a spreadsheet of links, but I don't see it. If you included it as an email attachment then github probably ate it. Would it be possible for your to email me directly?

Thank you for the explanation on fly SVs (which I understand a bit more after reading your paper). I will temper my expectations from short-read genotyping accordingly, but will run a few samples ASAP to see if there's any signal at all.

— Reply to this email directly, view it on GitHub < https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1136389372__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!PrXgsb2WHGE9kYcuKUPlGjSLmZNcXz1ZaBehVqAAiTYlPB6TpY5L5uTpfMnhY7qxrZAHLZi5dpDlZADdqj13Jwk$>, or unsubscribe < https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2N2PDXEBEIJLVXJM4DDVLU2GZANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!PrXgsb2WHGE9kYcuKUPlGjSLmZNcXz1ZaBehVqAAiTYlPB6TpY5L5uTpfMnhY7qxrZAHLZi5dpDlZADd_XB3I5o$ . You are receiving this because you were mentioned.

— Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/hal/issues/242#issuecomment-1136431837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373RSGKDYN2BCCZ555GTVLVADPANCNFSM5TRJLXRA . You are receiving this because you commented.Message ID: @.***>

tdlong commented 1 year ago

Glen:

And to be totally honest we have tried various way to coerce structural variants into a VCF. For some (like TEs) I think this is effective. For other (like tandem dups) it can be more difficult. This is what I love about progressive cactus. If you are a biologist interested in gene X, those SNAKE tracks are really really good at telling you what stuff may be lurking in the data. The problem is much harder (and more pervasive) than 99% of people appreciate.

On May 13, 2022, at 11:49 AM, Tony @.***> wrote:

Glen:

This is a very good and bad result. It is good because it means I am not crazy (at least not yet). It is bound to happen one day.

It is bad, because, this should work.

These data are public. Except for one genome, but it should be public, so you can treat the data as public, certainly there are no constraints on your using it.

Here is the paper:

https://www.nature.com/articles/s41467-019-12884-1 https://www.nature.com/articles/s41467-019-12884-1

I only shared a single chromosome with you so it is fast. And in fact it may be better to do the alignments that way and merge the HALs. I could easily share more. My goal was to share less, I would like a reproducible example that runs with the least data possible.

This would be a great dataset for testing your new ideas. Years ago we tried to force the alignments into a pre-release of "vg" ... but that seemed to crash it at the time. Structural variants in these assemblies create all sort of problems for graph-based ways of representing genomes. But I know you guys have been working hard on this.

T.

On May 13, 2022, at 11:32 AM, Glenn Hickey @. @.>> wrote:

I can reprodce. Thanks for putting together a reduced dataset. For the record, I ran (with the current master @.*** https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/cactus/commit/8c56cbf8c54bf0fc215012775c79f8b3368055ec__;!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbKnbSP8BQ$)

Preprocessor

cactus-preprocess ./jobstore/0 ./fly.2L.txt prep/fly.2L.txt --inputNames a1 a3 a5 a7 b1 b3 b6 dm6 a2 a4 a6 ab8 b2 b4 b7 ore --realTimeLogging --logInfo --retryCount 0

Alignment

cactus-blast ./jobstore/1 prep/fly.2L.txt prep/Anc0.cigar --root Anc0 --realTimeLogging --logInfo --retryCount 0 Then I made a pairwise cigar with

grep 'id=12' prep/Anc0.cigar | grep 'id=14' > prep/Anc0.pair.cigar grep 'id=12' prep/Anc0.cigar.secondary | grep 'id=14' > prep/Anc0.pair.cigar.secondary Then aligned with

cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.cigar prep/default.hal --root Anc0 --realTimeLogging --logInfo cactus-align ./jobstore/2 prep/fly.2L.txt prep/Anc0.pair.cigar prep/pair.hal --root Anc0 --realTimeLogging --logInfo And if I look at the maf's in the region you metnion, I see the alignment in the pair verison but no the full

hal2maf prep/default.hal prep/default.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000 hal2maf prep/pair.hal prep/pair.maf --refGenome dm6 --targetGenomes b6 --refSequence chr2L --start 8680000 --length 30000 When I do things like turn off various CAF filters and/or BAR, I still don't see the alignment for either default or pair. This strongly implies that the alignment in question is not captured by lastz. In the pairwise case, the cactus base aligner (BAR) gets it, but not in the full case. Since you are sure this used to work, I think the most likely assumption is a regression in the base aligner. There are some slowish experiments I will run to try to see if this is the case.

As an aside: is this data public? I'm in the process of writing up the new "pangenome" mode for cactus which is designed for datasets like this (I will also test it in your data). The advantages are that it

scales better with number of genomes (ie you could do star tree of hundreds of flies) can make a graph that can be used for mapping experiments can convert to VCF, which makes reasoning about variants easier. Let me know if any of these areas interest you and you'd be willing to collaborate...

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/ComparativeGenomicsToolkit/hal/issues/242*issuecomment-1126335918__;Iw!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbK4qgMcn0$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEYI2NZYQ3BGUWR3ODWMDEDVJ2N27ANCNFSM5TRJLXRA__;!!CzAuKJ42GuquVTTmVmPViYEvSg!MbpR8XsWrHpw5nggiuoZXrJAGmevgntdn_OL1G8VYGHI_olISaFYRvE8i57Qe-qTFgb7MWjbtjXyPCbKQvuyuQU$. You are receiving this because you were mentioned.