ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
505 stars 111 forks source link

cactus-maf2bigmaf script fails with 'NoneType' error #1332

Closed crinfante closed 5 months ago

crinfante commented 6 months ago

I'm running cactus-maf2bigmaf as follows:

cactus-maf2bigmaf \
  --refGenome mm39 \
  --chromSizes mm39.chrom.sizes" \
  "${SLURM_JOBID}/jobstore" \
  group14.mm39.maf.gz \
  group14.mm39.bb

And the job fails at the maf2bigmaf_summary step:

[2024-04-01T11:48:16-0600] [Thread-1  ] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker maf2bigmaf_summary file:/storage/biology/projects/lab/wga/group/10666808/jobstore kind-maf2bigmaf_summary/instance-whl0wp7k.
[2024-04-01T11:48:16-0600] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-whl0wp7k v1
Exit reason: None
[2024-04-01T11:48:16-0600] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-whl0wp7k v2
[2024-04-01T11:48:16-0600] [MainThread] [W] [toil.leader] Log from job "kind-maf2bigmaf_summary/instance-whl0wp7k" follows:
=========>
    [2024-04-01T11:48:14-0600] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2024-04-01T11:48:14-0600] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host math-alderaan-c27.
    [2024-04-01T11:48:14-0600] [MainThread] [I] [toil.worker] Working on job 'maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-whl0wp7k v1
    [2024-04-01T11:48:14-0600] [MainThread] [I] [toil.worker] Loaded body Job('maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-whl0wp7k v1) from description 'maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-whl0wp7k v1
    [2024-04-01T11:48:14-0600] [MainThread] [I] [toil-rt] Reading MAF file from job store to /tmp/1837266a2aad5d009892452dcb0f6f60/f932/aacf/tmphn6rxxg0/group14.mm39.maf.gz
    [2024-04-01T11:48:15-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
    [2024-04-01T11:48:15-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-2f35ee974b8747d6b0173782bbc196fe/group14.mm39.maf.gz' to path '/tmp/1837266a2aad5d009892452dcb0f6f60/f932/aacf/tmphn6rxxg0/group14.mm39.maf.gz'
    [2024-04-01T11:48:15-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-644b30a8d6174d7cab1cc25832137824/mm39.chrom.sizes' to path '/tmp/1837266a2aad5d009892452dcb0f6f60/f932/aacf/tmphn6rxxg0/mm39.chrom_sizes'
    Traceback (most recent call last):
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/worker.py", line 407, in workerScript
        job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2829, in _runner
        returnValues = self._run(jobGraph=None, fileStore=fileStore)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2746, in _run
        return self.run(fileStore)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2974, in run
        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/cactus/maf/cactus_maf2bigmaf.py", line 255, in maf2bigmaf_summary
        sed_scripts = get_sed_rename_scripts(work_dir, genomes_list, out_bed=True)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/cactus/maf/cactus_hal2maf.py", line 407, in get_sed_rename_scripts
        genome_set = set(genome_list)
    TypeError: 'NoneType' object is not iterable
    [2024-04-01T11:48:15-0600] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host math-alderaan-c27

I don't know how to interpret the TypeError: 'NoneType'. Any help would be appreciated. I'd rather not have to resort to converting the MAF stepwise using the old UCSC Genome Browser FAQ. Thanks!

glennhickey commented 6 months ago

Can you try using --halFile <hal file> instead of --chromSizes and telling me if it works?

I think the issue is that it's finding a genome name with a . in it (which screws up bigmaf summary) and tries to work around it -- but that logic only works with a HAL input. If this is what's going on, there needs to be a better error message.

crinfante commented 6 months ago

Now it fails at the maf conversion step with exited 255: stderr=reference sequence has to be on positive strand on line 954211. So is it a problem with with the original HAL file format?

The command was:

cactus-maf2bigmaf \
  --refGenome mm39 \
  --halFile group14.hal \
  "${SLURM_JOBID}/jobstore" \
  group14.mm39.maf.gz \
  group14.mm39.bb

And the log:

[2024-04-02T09:54:29-0600] [MainThread] [I] [toil.statsAndLogging] Cactus Command: /home/biology/lab/.miniforge3/envs/cactus_align/bin/cactus-maf2bigmaf --refGenome mm39 --halFile group14-way.hal 10667863/jobstore group14-way.mm39.maf.gz group14-way.mm39.bb
[2024-04-02T09:54:29-0600] [MainThread] [I] [toil.statsAndLogging] Cactus Commit: 7286b49b264896f43cc64aa405b39f914d43f75b
[2024-04-02T09:54:29-0600] [MainThread] [I] [toil.statsAndLogging] Importing group14-way.mm39.maf.gz
[2024-04-02T09:54:35-0600] [MainThread] [I] [toil.statsAndLogging] Importing group14-way.hal
[2024-04-02T09:54:35-0600] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host math-alderaan-c12.
[2024-04-02T09:54:35-0600] [MainThread] [I] [toil.realtimeLogger] Starting real-time logging.
[2024-04-02T09:54:35-0600] [MainThread] [I] [toil.leader] Issued job 'maf2bigmaf_workflow' kind-maf2bigmaf_workflow/instance-_742whsu v1 with job batch system ID: 1 and disk: 2.0 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-04-02T09:54:36-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:36.358197: Running the command: "mafToBigMaf"
[2024-04-02T09:54:36-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:36.533961: Running the command: "bedToBigBed"
[2024-04-02T09:54:36-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:36.758923: Running the command: "hgLoadMafSummary"
[2024-04-02T09:54:37-0600] [MainThread] [I] [toil.leader] 0 jobs are running, 0 jobs are issued and waiting to run
[2024-04-02T09:54:37-0600] [MainThread] [I] [toil.leader] Issued job 'maf2bigmaf_chrom_sizes' kind-maf2bigmaf_chrom_sizes/instance-cq3omh_8 v1 with job batch system ID: 2 and disk: 16.7 Gi, memory: 2.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-04-02T09:54:37-0600] [MainThread] [I] [toil-rt] Reading HAL file from job store to /tmp/69baea9fc96d57f4ba92bc8ba4d22355/abf5/a62f/tmp8a517twi/group14-way.hal
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil-rt] Computing chromosome sizes
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:56.040814: Running the command: "halStats /tmp/69baea9fc96d57f4ba92bc8ba4d22355/abf5/a62f/tmp8a517twi/group14-way.hal --chromSizes mm39"
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:56.411228: Successfully ran: "halStats /tmp/69baea9fc96d57f4ba92bc8ba4d22355/abf5/a62f/tmp8a517twi/group14-way.hal --chromSizes mm39" in 0.2696 seconds
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:56.411830: Running the command: "halStats --genomes /tmp/69baea9fc96d57f4ba92bc8ba4d22355/abf5/a62f/tmp8a517twi/group14-way.hal"
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:56.455361: Successfully ran: "halStats --genomes /tmp/69baea9fc96d57f4ba92bc8ba4d22355/abf5/a62f/tmp8a517twi/group14-way.hal" in 0.0152 seconds
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil.leader] Issued job 'maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v1 with job batch system ID: 3 and disk: 4.7 Gi, memory: 4.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-04-02T09:54:56-0600] [MainThread] [I] [toil.leader] Issued job 'maf2bigmaf_summary' kind-maf2bigmaf_summary/instance-ejt_h2f_ v1 with job batch system ID: 4 and disk: 1.4 Gi, memory: 4.0 Gi, cores: 1, accelerators: [], preemptible: False
[2024-04-02T09:54:56-0600] [Thread-4  ] [W] [toil.statsAndLogging] Got message from job at time 04-02-2024 09:54:56: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job 'maf2bigmaf_chrom_sizes' kind-maf2bigmaf_chrom_sizes/instance-cq3omh_8 v1 used 100.00% disk (16.7 GiB [17966690304B] used, 16.7 GiB [17966684624B] requested).
[2024-04-02T09:54:57-0600] [MainThread] [I] [toil-rt] Reading MAF file from job store to /tmp/69baea9fc96d57f4ba92bc8ba4d22355/65ef/b252/tmpb02u1rq5/group14-way.mm39.maf.gz
[2024-04-02T09:54:57-0600] [MainThread] [I] [toil-rt] Reading MAF file from job store to /tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz
[2024-04-02T09:54:58-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:54:58.178682: Running the command: "bash -c set -eo pipefail && gzip -dc /tmp/69baea9fc96d57f4ba92bc8ba4d22355/65ef/b252/tmpb02u1rq5/group14-way.mm39.maf.gz | mafDuplicateFilter -km - | hgLoadMafSummary -minSeqSize=1 -test mm39 bigMafSummary stdin"
[2024-04-02T09:55:12-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:55:12.655398: Running the command: "bash -c set -eo pipefail && gzip -dc /tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz | mafDuplicateFilter -km - | mafToBigMaf mm39 stdin stdout | sort -k1,1 -k2,2n"
[2024-04-02T09:55:26-0600] [Thread-1  ] [E] [toil.batchSystems.singleMachine] Got exit code 1 (indicating failure) from job _toil_worker maf2bigmaf file:/data002/scratch/lab/wga/group/10667863/jobstore kind-maf2bigmaf/instance-o4cr7qph.
[2024-04-02T09:55:26-0600] [MainThread] [W] [toil.leader] Job failed with exit value 1: 'maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v1
Exit reason: None
[2024-04-02T09:55:26-0600] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v2
[2024-04-02T09:55:26-0600] [MainThread] [W] [toil.leader] Log from job "kind-maf2bigmaf/instance-o4cr7qph" follows:
=========>
    [2024-04-02T09:54:57-0600] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
    [2024-04-02T09:54:57-0600] [MainThread] [I] [toil] Running Toil version 6.0.0-0e2a07a20818e593bfdfde3cc51ca4ad809fde96 on host math-alderaan-c12.
    [2024-04-02T09:54:57-0600] [MainThread] [I] [toil.worker] Working on job 'maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v1
    [2024-04-02T09:54:57-0600] [MainThread] [I] [toil.worker] Loaded body Job('maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v1) from description 'maf2bigmaf' kind-maf2bigmaf/instance-o4cr7qph v1
    [2024-04-02T09:54:57-0600] [MainThread] [I] [toil-rt] Reading MAF file from job store to /tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz
    [2024-04-02T09:55:12-0600] [MainThread] [I] [toil-rt] 2024-04-02 09:55:12.655398: Running the command: "bash -c set -eo pipefail && gzip -dc /tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz | mafDuplicateFilter -km - | mafToBigMaf mm39 stdin stdout | sort -k1,1 -k2,2n"
    [2024-04-02T09:55:25-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
    [2024-04-02T09:55:25-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-8cec0a3771684393aeb73a9ceea2a54e/group14-way.mm39.maf.gz' to path '/tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz'
    [2024-04-02T09:55:25-0600] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-maf2bigmaf_chrom_sizes/instance-cq3omh_8/file-cb3f6e93e1ee4a4e917b9094ffb39964/mm39.chrom_sizes' to path '/tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/mm39.chrom_sizes'
    Traceback (most recent call last):
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/worker.py", line 407, in workerScript
        job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2829, in _runner
        returnValues = self._run(jobGraph=None, fileStore=fileStore)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2746, in _run
        return self.run(fileStore)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/toil/job.py", line 2974, in run
        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/cactus/maf/cactus_maf2bigmaf.py", line 228, in maf2bigmaf
        cactus_call(parameters=bigmaf_cmd, outfile=bigmaf_bed_path)
      File "/home/biology/lab/.miniforge3/envs/cactus_align/lib/python3.8/site-packages/cactus/shared/common.py", line 906, in cactus_call
        raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
    RuntimeError: Command ['bash', '-c', 'set -eo pipefail && gzip -dc /tmp/69baea9fc96d57f4ba92bc8ba4d22355/f918/9ddf/tmp3yglfowq/group14-way.mm39.maf.gz | mafDuplicateFilter -km - | mafToBigMaf mm39 stdin stdout | sort -k1,1 -k2,2n'] exited 255: stderr=reference sequence has to be on positive strand on line 954211

    [2024-04-02T09:55:26-0600] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host math-alderaan-c12
<=========
glennhickey commented 5 months ago

Yes, this is another issue #1320 that I'm trying to figure out now...