ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
499 stars 109 forks source link

Runtime error related to cleanTree #1010

Closed jkreinz closed 1 year ago

jkreinz commented 1 year ago

Hi there,

I'm using minigraph-cactus to produce a pangenome from 6 haplotype-level phased assemblies. I used the command cactus-pangenome ./js samplenames.txt --outDir atub_pg/ --outName atub_pg --reference Atub_ref_193.1 --vcf --giraffe --gfa --gbz, and it ran for several hours before throwing the error below. From looking at other issue reports, it seems that this error may be related to tree specification in the seqfile, however as implied from the minigraph documentation, I never specify a tree in my seqfile, and so I'm not sure what I should change to resolve this issue.

Thanks, Julia

Traceback (most recent call last):
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/toil/worker.py", line 403, in workerScript
            job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/toil/job.py", line 2743, in _runner
            returnValues = self._run(jobGraph=None, fileStore=fileStore)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/toil/job.py", line 2660, in _run
            return self.run(fileStore)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/toil/job.py", line 2888, in run
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/refmap/cactus_pangenome.py", line 252, in make_batch_align_jobs_wrapper
            align_jobs = make_batch_align_jobs(options, job.fileStore, job.fileStore, config_wrapper)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/setup/cactus_align.py", line 226, in make_batch_align_jobs
            chrom_align_job = make_align_job(chrom_options, toil, config_wrapper, chrom)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/setup/cactus_align.py", line 244, in make_align_job
            mc_tree, input_seq_map, og_candidates = parse_seqfile(options.seqFile, config_wrapper,
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/progressive_decomposition.py", line 41, in parse_seqfile
            seq_file = SeqFile(seqfile_path, root_name = root_name)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/seqFile.py", line 68, in __init__
            self.parseFile(path, root_name)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/seqFile.py", line 106, in parseFile
            self.cleanTree(root_name)
          File "/DATA/home/jmlazaro/Software/cactus/cactus_env/lib/python3.8/site-packages/cactus/progressive/seqFile.py", line 199, in cleanTree
            raise RuntimeError("At least two valid leaf genomes required in"
        RuntimeError: At least two valid leaf genomes required in input tree
        [2023-05-09T22:09:09-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host 
glennhickey commented 1 year ago

Hi. This is a strange one. You can find out which contig is causing this by looking in atub_pg/chrom-subproblems/seqfiles and finding the file with only one line in it.

And I think the work-around would be using --refContigs chr1 chr2 etc (where you're listing the reference chromosome names of your reference sample -- assuming the problem contig found above isn't a real chromosome).

I would love to be able to reproduce this in order to be able to fix it, but I haven't had any luck with contrived examples so far. Thanks for the feedback.

jkreinz commented 1 year ago

Thanks Glenn, I've tried a rerun with the --refContigs option listing just the chromosome level scaffolds (which excludes the two scaffolds that had only 1 line in their seqfiles). I'll let you know how it goes.

If you're interested in troubleshooting, I've temporarily uploaded the sequences I'm working with to a google drive folder here.

My seq file is as follows:

# Diploid samples
Atub_ref_193.1  /DATA/home/juliak/amaranthus/final_genomes/Atub_193_hap1.fasta
Atub_193.2      /DATA/home/juliak/amaranthus/final_genomes/Atub_193_hap2.fasta
Atub_Nune5.1      /DATA/home/juliak/amaranthus/final_genomes/Atub_Nune5_hap1.fasta
Atub_Nune5.2      /DATA/home/juliak/amaranthus/final_genomes/Atub_Nune5_hap2.fasta
Atub_Nat.1      /DATA/home/juliak/amaranthus/final_genomes/Atub_Nat_hap1.fasta
Atub_Nat.2      /DATA/home/juliak/amaranthus/final_genomes/Atub_Nat_hap2.fasta
glennhickey commented 1 year ago

Thanks for sharing the data! Running it now...

glennhickey commented 1 year ago

I can reproduce this. The problem scaffolds are Scaffold_1608 and Scaffold_1806. I will fix this in the next release, which should be soon. But you can work around it by adding

--refContigs $(for i in $(seq 16); do printf "Scaffold_$i "; done) --otherContig chrOther

This invocation will make a separate job for each of the large scaffolds (1-16 by my count) and throw all the rest of the small ones into a single job (the output graph will still contain an output component for each contig and would be identical to the output of not adding these flags if it didn't crash).

This is discussed a bit in the documentation

Important The reference genome assembly must be chromosome scale. If your reference assembly also consists of many small fragments (ex GRCh38) then you must use the --refContigs option to specify the chromosomes. Ex for GRCh38 --refContigs $(for i in seq 22; do printf "chr$i "; done ; echo "chrX chrY chrM"). If you want to include the remaining reference contig fragments in your graph, add the --otherContig chrOther option.

But I honestly never expected not doing so to lead to this weird error. Thanks for pointing it out and sharing the data. I'm rerunning with the above options to make sure it works.

If you are rerunning, I also recommend adding --indexCores 16 --mapCores 8 or something to that effect to speed things up if you have the CPUs -- the default values aren't that great right now.

glennhickey commented 1 year ago

This is fixed in the latest release.