ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
514 stars 112 forks source link

Add to subtree #123

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Good morning, I'd like to perform the multiple genome alignments among 9 different genomes. For some reasons, I'm getting the same error as #114, which seems to appear when aligning many genomes (issue that I confirm, since aligning 5 genomes gave me no problems). In the attempt to find a workaround, I'd like to split the problem into multiple subproblems, that can later be combined in a single large hal file. I start from a tree like the following: cactusTreeFull How do you recommend to proceed? I was thinking of aligning the groups of genomes 1 to 5, generating a first hal file. Then, generate the alignments for genomes 5 to 9, and once these are ready combine the results in a single final hal file. However, before proceeding with the analyses, how do you recommend to proceed?

  1. Create group-specific trees, and then combining providing a final general tree? Or create a single large tree, and extract the branches and distances from that?
  2. How do you suggest to merge the data?I know the existance of halAddToBranch, but I'm not sure about it's usage.

Thank you in advance for your help,

Andrea

diekhans commented 4 years ago

Hi Andrea,

Would you be able to give us access to HAL files? I haven't been able to create a case to reproduce this problem.

RenzoTale88 notifications@github.com writes:

Good morning, I'd like to perform the multiple genome alignments among 9 different genomes. For some reasons, I'm getting the same error as #114, which seems to appear when aligning many genomes (issue that I confirm, since aligning 5 genomes gave me no problems). In the attempt to find a workaround, I'd like to split the problem into multiple subproblems, that can later be combined in a single large hal file. I start from a tree like the following: cactusTreeFull How do you recommend to proceed? I was thinking of aligning the groups of genomes 1 to 5, generating a first hal file. Then, generate the alignments for genomes 5 to 9, and once these are ready combine the results in a single final hal file. However, before proceeding with the analyses, how do you recommend to proceed?

  1. Create group-specific trees, and then combining providing a final general tree? Or create a single large tree, and extract the branches and distances from that?
  2. How do you suggest to merge the data?I know the existance of halAddToBranch, but I'm not sure about it's usage.

Thank you in advance for your help,

Andrea

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/123 Good morning, I'd like to perform the multiple genome alignments among 9 different genomes. For some reasons, I'm getting the same error as #114, which seems to appear when aligning many genomes (issue that I confirm, since aligning 5 genomes gave me no problems). In the attempt to find a workaround, I'd like to split the problem into multiple subproblems, that can later be combined in a single large hal file. I start from a tree like the following: cactusTreeFull How do you recommend to proceed? I was thinking of aligning the groups of genomes 1 to 5, generating a first hal file. Then, generate the alignments for genomes 5 to 9, and once these are ready combine the results in a single final hal file. However, before proceeding with the analyses, how do you recommend to proceed?

  1. Create group-specific trees, and then combining providing a final general tree? Or create a single large tree, and extract the branches and distances from that?
  2. How do you suggest to merge the data?I know the existance of halAddToBranch, but I'm not sure about it's usage.

Thank you in advance for your help,

Andrea

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*

RenzoTale88 commented 4 years ago

Hi, sorry I probably explained the problem very poorly. I'm trying to generate a hal file with the nine genomes in the phylogeny tree shown above. Since CACTUS fails every time I try to run it with the whole set of 9 assemblies, I was trying to split it everything into smaller subproblems, that can be run concurrently. Right now, I'm running cactus on two sets:

  1. genomes 1 to 4 (subproblem1.hal)
  2. genomes 4 to 9 (subproblem2.hal)

My question is: once I got the two separate hal files (subproblem1.hal and subproblem2.hal) how should I merge these dataset? Also, I'm running the two subproblems with tree computed using only the genomes involved in the run. Should I use the whole tree instead?

Thank you again, Andrea

diekhans commented 4 years ago

Sorry, my confusion. So I think this will work

halAppendSubtree --merge subproblem1.hal subproblem2.hal genome_4 genome_4

Please do backup subproblem1.hal as it is modified.

What problems did you have running cactus on all assemblies?

RenzoTale88 notifications@github.com writes:

Hi, sorry I probably explained the problem very poorly. I'm trying to generate a hal file with the nine genomes in the phylogeny tree shown above. Since CACTUS fails every time I try to run it with the whole set of 9 assemblies, I was trying to split it everything into smaller subproblems, that can be run concurrently. Right now, I'm running cactus on two sets:

  1. genomes 1 to 4 (subproblem1.hal)
  2. genomes 4 to 9 (subproblem2.hal)

My question is: once I got the two separate hal files (subproblem1.hal and subproblem2.hal) how should I merge these dataset? Also, I'm running the two subproblems with tree computed using only the genomes involved in the run. Should I use the whole tree instead?

Thank you again, Andrea

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/123#issuecomment-574103813 Hi, sorry I probably explained the problem very poorly. I'm trying to generate a hal file with the nine genomes in the phylogeny tree shown above. Since CACTUS fails every time I try to run it with the whole set of 9 assemblies, I was trying to split it everything into smaller subproblems, that can be run concurrently. Right now, I'm running cactus on two sets:

  1. genomes 1 to 4 (subproblem1.hal)
  2. genomes 4 to 9 (subproblem2.hal)

My question is: once I got the two separate hal files (subproblem1.hal and subproblem2.hal) how should I merge these dataset? Also, I'm running the two subproblems with tree computed using only the genomes involved in the run. Should I use the whole tree instead?

Thank you again, Andrea

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*

RenzoTale88 commented 4 years ago

Thank you for your answer, this is exactly what I needed! One more question: after merging, is it needed to recompute the Ancestral genomes? Or will it be done automatically?

Regarding the error, I'm getting the same shown in #114 and haven't found a solution yet.

To give you a bit more details, I've installed CACTUS through anaconda (see https://anaconda.org/bioconda/cactus). I've tried to update the software to the latest cactus build multiple times, but never succeeded. Install it manually and locally always led to several problems with toil and/or kyototycoon, that I never maneged to fully solve. I'm working on a UGE cluster environment, and that was the only way I've found to install it locally with all the dependencies working.

Thank you again for your help before!

Secretloong commented 4 years ago

Hi Andrea & Mark,

I suspect that halAppendSubtree --merge subproblem1.hal subproblem2.hal genome_4 genome_4 would work. If it works, please tell me. Thank you very much.

With my practice, you should merge the ancestor of genome_4 and the root of subproblem2.hal. But you have no ancestor of genome_4 in subproblem1.hal, so you should merge from the ancestor of genome_4 and genome_1. And you need a hal tree with the ancestor of genome_4 and genome_1 from subproblem1.hal. Now we assume that you still want merge between the genome_4 and subproblem2, I think the follow one would work, for details:

genomes 1 to 5 (subproblem1.hal)
genomes 4 to 9 with the ancestor of 4 and 5 (named Anc004, obtained from subproblem1.hal by hal2fasta) as the root (subproblem2.hal)
cp subproblem1.hal subproblem.merge.hal
halAppendSubtree subproblem.merge.hal subproblem2.hal Anc004 Anc004 --merge

BTW, is there any mammal in your species? Up to now, I'm considering that the human assembly or some other mammals would stuck the ktservice in Cactus (another work from my friend, she filtered out human then Cactus worked well). I am also trying to split the whole tree to several small tree to finish the Cactus. But there is still one hidden issue: the NumBottomSegments or NumTopSegments between the merging node would narrow down very much (about 20%). Even though the segment number is not equal to the aligned sequences, and the developer said that "had accuracy nearly identical to the full alignment" in their preprint paper, I think the best way is still the full alignments from all species.

RenzoTale88 commented 4 years ago

Hi @Secretloong , Thank you for your answer, I'm currently waiting for the two subproblems to finish. Once they're done, I'll try the different approaches and see how they work.

Yes, the alignments is among several mammalian species (non-human though).

diekhans commented 4 years ago

We have successful build a alignment with ~200 mammals. It was a pain. I suspect the ktserver problems maybe memory related.

It is a high priority to replace ktserver ..

Secretloong notifications@github.com writes:

Hi Andrea & Mark,

I suspect that halAppendSubtree --merge subproblem1.hal subproblem2.hal genome_4 genome_4 would work. If it works, please tell me. Thank you very much.

With my practice, you should merge the ancestor of genome_4 and the root of subproblem2.hal. But you have no ancestor of genome_4 in subproblem1.hal, so you should merge from the ancestor of genome_4 and genome_1. And you need a hal tree with the ancestor of genome_4 and genome_1 from subproblem1.hal. For details:

genomes 1 to 5 (subproblem1.hal)
genomes 4 to 9 with the ancestor of 4 and 5 (named Anc004) as the root (subproblem2.hal)
cp subproblem1.hal subproblem.merge.hal
halAppendSubtree subproblem.merge.hal subproblem2.hal Anc004 Anc004 --merge

BTW, is there any mammal in your species? Up to now, I'm considering that the human assembly or some other mammals would stuck the ktservice in Cactus (form another work from my friend, she filtered out human then Cactus worked well). I am also trying to split the whole tree to several small tree to finish the Cactus. But there is still one hidden issue: the NumBottomSegments or NumTopSegments between the merging node would narrow down very much (about 20%). Even though the segment number is not equal to the aligned sequences, and the developer said that "had accuracy nearly identical to the full alignment" in their preprint paper, I think the best way is still the full alignments from all species.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/ComparativeGenomicsToolkit/cactus/issues/123#issuecomment-574538989 Hi Andrea & Mark,

I suspect that halAppendSubtree --merge subproblem1.hal subproblem2.hal genome_4 genome_4 would work. If it works, please tell me. Thank you very much.

With my practice, you should merge the ancestor of genome_4 and the root of subproblem2.hal. But you have no ancestor of genome_4 in subproblem1.hal, so you should merge from the ancestor of genome_4 and genome_1. And you need a hal tree with the ancestor of genome_4 and genome_1 from subproblem1.hal. For details:

genomes 1 to 5 (subproblem1.hal) genomes 4 to 9 with the ancestor of 4 and 5 (named Anc004) as the root (subproblem2.hal) cp subproblem1.hal subproblem.merge.hal halAppendSubtree subproblem.merge.hal subproblem2.hal Anc004 Anc004 --merge

BTW, is there any mammal in your species? Up to now, I'm considering that the human assembly or some other mammals would stuck the ktservice in Cactus (form another work from my friend, she filtered out human then Cactus worked well). I am also trying to split the whole tree to several small tree to finish the Cactus. But there is still one hidden issue: the NumBottomSegments or NumTopSegments between the merging node would narrow down very much (about 20%). Even though the segment number is not equal to the aligned sequences, and the developer said that "had accuracy nearly identical to the full alignment" in their preprint paper, I think the best way is still the full alignments from all species.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.*

Secretloong commented 4 years ago

@diekhans thanks for your hard and valuable works, and Cacuts is an awasome aligner. I am looking forward to the new version.

BTW, could you give us more details about 200 mammals aligning? For example:

  1. did you split 200 mammals to several subgroups? and how many species of each subgroup?
  2. how a computer you used for 200 mammals, how many cpu? how much memory?
  3. if you happened the ktserver problems, how did you solve them? Even some manual ways are helpful.
  4. could users align a subtree job from the ktserver data generated by its full alignments?

Thank you very much!

diekhans commented 4 years ago

@diekhans thanks for your hard and valuable works, and Cacuts is an awasome aligner. I am looking forward to the new version.

Most of the development of cactus and the 200 mammals alignments were done by @joelarmstrong, He now has a real job and I am taking up the slack part-time until we can get someone totally devoted to it.

BTW, could you give us more details about 200 mammals aligning? For example:

  1. did you split 200 mammals to several subgroups? and how many species of each subgroup?

It was done as a single alignment, although it really stretched the whole system. The 363-way bird alignment done for the B10K project was actually easier because the genomes are simplers.

  1. how a computer you used for 200 mammals, how many cpu? how much memory?

It was run in AWS and took 1.7 million core-hours. I don't know the exact configuration of the instances used. Most were small memory, with a few larger memory instances.

  1. if you happened the ktserver problems, how did you solve them? Even some manual ways are helpful.

Something Joel fought, so I can't really tell you the details. Lets us know if you hit problems and maybe we can suggest something.

  1. could users align a subtree job from the ktserver data generated by its full alignments?

In theory, all the data is there. However, there are no commands to do this. It would be useful to have as it would save the cost of creating the HAL files until the end.

RenzoTale88 commented 4 years ago

So, I've tried to split the dataset into two separate subproblems as described above. The first of the two subproblems failed giving the same problem as before:

Got message from job at time 01-16-2020 08:51:49: At end of avg phase, got stats {
  "flowerName": 0,
  "totalBases": 8584079220,
  "totalEnds": 499144,
  "totalCaps": 1503990,
  "maxEndDegree": 46,
  "maxAdjacencyLength": 1521908,
  "totalBlocks": 168660,
  "totalGroups": 116928,
  "totalEdges": 688392,
  "totalFreeEnds": 136288,
  "totalAttachedEnds": 25536,
  "totalChains": 94508,
  "totalLinkGroups": 94508
}
flower name: 0 total bases: 8584079220 total-ends: 499144 total-caps: 1503990 max-end-degree: 46 max-adjacency-length: 1521908 total-blocks: 168660 total-groups: 116928 total-edges: 344196 total-free-ends: 136288 total-attached-ends: 25536 total-chains: 94508 total-link g

Job ended successfully: 'CactusBarRecursion' R/s/jobIZnoe4
Job ended successfully: 'KtServerService' P/X/jobrFjwj0
Issued job 'StartPrimaryDB' Z/G/jobec0ip1 with job batch system ID: 60790 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Got message from job at time 01-16-2020 08:52:16: Job used more disk than requested. Consider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job Z/G/jobec0ip1/g/tmpvXi5zg.tmp used 1059.84% (21.2 GB [22759866368B] used, 2.0
Job ended successfully: 'StartPrimaryDB' Z/G/jobec0ip1
Issued job 'CactusReferenceCheckpoint' Q/S/jobaW2dnu with job batch system ID: 60791 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Job ended successfully: 'CactusReferenceCheckpoint' Q/S/jobaW2dnu
Issued job 'StartPrimaryDB' 5/4/job6Df2xk with job batch system ID: 60792 and cores: 1, disk: 2.0 G, and memory: 3.3 G
Job ended successfully: 'StartPrimaryDB' 5/4/job6Df2xk
Issued job 'KtServerService' 9/B/job2KCiy1 with job batch system ID: 60793 and cores: 0, disk: 2.0 G, and memory: 42.5 G
Job ended successfully: 'KtServerService' 9/B/job2KCiy1
The job seems to have left a log file, indicating failure: 'KtServerService' 9/B/job2KCiy1
9/B/job2KCiy1    INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
9/B/job2KCiy1    INFO:toil:Running Toil version 3.14.0-b91dbf9bf6116879952f0a70f9a2fbbcae7e51b6.
9/B/job2KCiy1    WARNING:toil.resource:'JTRES_10e8c8e3ddc478d32aa0fe73e676fc55' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
9/B/job2KCiy1    WARNING:toil.resource:'JTRES_10e8c8e3ddc478d32aa0fe73e676fc55' may exist, but is not yet referenced by the worker (KeyError from os.environ[]).
9/B/job2KCiy1    INFO:cactus.shared.common:Running the command ['netstat', '-tuplen']
9/B/job2KCiy1    (No info could be read for "-p": geteuid()=2064601 but you should be root.)
9/B/job2KCiy1    INFO:cactus.shared.common:Running the command ['ktserver', '-port', '13327', '-ls', '-tout', '200000', '-th', '64', '-bgs', u'PATH/TO/CACTUS_SUBPROBLEM1/TMP/toil-ec71f0b0-45d7-4563-a9fd-86853f
9/B/job2KCiy1    CRITICAL:toil.lib.bioio:Error starting ktserver.
9/B/job2KCiy1    CRITICAL:toil.lib.bioio:Error starting ktserver.
9/B/job2KCiy1    INFO:cactus.shared.common:Running the command ['ktremotemgr', 'remove', '-port', '13327', '-host', '192.41.105.236', 'TERMINATE']
9/B/job2KCiy1    WARNING:toil.fileStore:LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid the chance of failure due to incorrectly requested resources. Job 5/4/job6Df2xk/g/tmppMeF2n.tmp used 370.72% (7.4 GB [7961182208B] used, 2
9/B/job2KCiy1    Traceback (most recent call last):
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/toil/worker.py", line 309, in workerScript
9/B/job2KCiy1        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/toil/job.py", line 1328, in _runner
9/B/job2KCiy1        returnValues = self._run(jobGraph, fileStore)
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/toil/job.py", line 1671, in _run
9/B/job2KCiy1        returnValues = self.run(fileStore)
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/toil/job.py", line 1621, in run
9/B/job2KCiy1        startCredentials = service.start(self)
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/cactus/pipeline/ktserverToil.py", line 33, in start
9/B/job2KCiy1        snapshotExportID=snapshotExportID)
9/B/job2KCiy1      File "PATH/TO/myanaconda/cactus_env/lib/python2.7/site-packages/cactus/pipeline/ktserverControl.py", line 62, in runKtserver
9/B/job2KCiy1        raise RuntimeError("Unable to launch ktserver in time. Log: %s" % log)
9/B/job2KCiy1    RuntimeError: Unable to launch ktserver in time. Log: 2020-01-16T08:52:33.698108Z: [SYSTEM]: ================ [START]: pid=161337
9/B/job2KCiy1    2020-01-16T08:52:33.698452Z: [SYSTEM]: opening a database: path=:#opts=ls#bnum=30m#msiz=50g#ktopts=p
9/B/job2KCiy1    2020-01-16T08:52:33.710297Z: [SYSTEM]: applying a snapshot file: db=0 ts=1579164561921000000 count=43959460 size=23312894938
9/B/job2KCiy1    2020-01-16T08:53:30.550681Z: [ERROR]: [DB]: :: 9: system error: too short region
9/B/job2KCiy1    2020-01-16T08:53:30.550889Z: [ERROR]: could not apply a snapshot: system error: too short region
9/B/job2KCiy1    2020-01-16T08:53:30.568695Z: [SYSTEM]: starting the server: expr=:13327
9/B/job2KCiy1    2020-01-16T08:53:30.579499Z: [SYSTEM]: server socket opened: expr=:13327 timeout=200000.0
9/B/job2KCiy1    2020-01-16T08:53:30.579549Z: [SYSTEM]: listening server socket started: fd=4
9/B/job2KCiy1    
9/B/job2KCiy1    ERROR:toil.worker:Exiting the worker because of a failed job on host node2j01.ecdf.ed.ac.uk
9/B/job2KCiy1    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'KtServerService' 9/B/job2KCiy1 with ID 9/B/job2KCiy1 to 5

Not sure what it's causing it. The job is running in a single node, with 20 cores. It never reaches the maximum RAM requested, and there is plenty of space available for it to write on the disk (at least a couple of Tb free). Subproblem 2 is still at the Lastz masking stage, hopefully it will work fine. Part of this subproblem (genomes 5 to 9) was already ran in a previous test of the software and finished without issues, so hopefully the addition of a single genome won't bother it. Not sure how to proceed.