ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
511 stars 112 forks source link

Job uses more disk than requested with singularity #1261

Closed syresbr closed 9 months ago

syresbr commented 9 months ago

I'm trying to run the tutorial 12/2023 South Africa Refgraph Hackathon with my data (just a fraction of all the data that I will use to check if I'm doing the things right) on a cluster it only generate the GFA file but it has problem with the disk saying it needs more and reviewing the logs it doesn't happens in the main thread only on the thread 4. It shows this error [Thread-4 (statsAndLoggingAggregator)] [W] [toil.statsAndLogging] Got message from job at time 01-08-2024 14:08:24: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-odgi_squeeze/instance-6ksil339/cleanup/file-cee5c12f46104224972c79efec5f5a07/stream used 103.49% disk (1.3 GiB [1406898176B] used, 1.3 GiB [1359448456B] requested).

My code is this :

#!/bin/bash -l

#$ -q all.q
#$ -cwd
#$ -pe smp 22
#$ -t 1

module load singularity-ce/3.11.2

singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.13 ca$

Thank you for your patience.

glennhickey commented 9 months ago

That message is a warning [W] and not an error, and should not cause a failure on most environments (I can see how it's confusing). If your job is failing, please share the whole log and maybe we can see where the real issue is.

syresbr commented 9 months ago

teste.log Here is the log.

glennhickey commented 9 months ago

This is getting through most of the pipeline, then failing when creating the VCF.

    RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" bash -c 'set -eo pipefail && vg deconstruct /home/gustavo.carvalho/pangenome/teste/f31aa1dfa6265f7da22db6c159ea7606/7d7c/dc82/tmpvjtsnh0y/clip.teste.gbz -P LA-PurpleR -C -a -r /home/gustavo.carvalho/pangenome/teste/f31aa1dfa6265f7da22db6c159ea7606/7d7c/dc82/tmpvjtsnh0y/clip.teste.snarls -t 47 -O | bgzip --threads 47' exited 1: stderr=Error [vg deconstruct]: No specified reference path or prefix found in graph

In particular, it's complaining that LA-PurpleR isn't in the graph. I'm not sure why that would be. There's a warning about a different genome

WARNING: Sample R570.5 has mash distance 0.0531179 from the reference. A value this high likely means your data is too diverse to construct a useful pangenome graph from.

but not LA-PurpleR. Another unrelated issue I see is that you are naming the other genomes R570.1-5. This will specify a single 5-ploid genome (which should be okay -- hopefuly that's what you want).

I'd suggest rerunning (you can use cactus-graphmap-join to pick up the end of the process) without the --vcfReference option. From what I can tell, it should all work. From there you can hopefully use the output to see what happened to LA-PurpleR. If you can share the input fastas with me, I can also try to reproduce here.

syresbr commented 9 months ago

This is the input file, I will try to run the rest. Genomesteste.txt Will keep in touch Many thanks.

glennhickey commented 9 months ago

Thanks for sharing the data, I can reproduce. The problem is that nothing is aligning, and this is tripping up the VCF export since only the first reference is in the graph.

You can check this by looking at the intermediate output as follows

for graph in chrom-alignments/*.vg ; do vg paths -Mx $graph | grep -v MINIGRAPH ; done 
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039579.1  REFERENCE   Np-XR   0   CM039579.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039583.1  REFERENCE   Np-XR   0   CM039583.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039587.1  REFERENCE   Np-XR   0   CM039587.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039591.1  REFERENCE   Np-XR   0   CM039591.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039595.1  REFERENCE   Np-XR   0   CM039595.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039599.1  REFERENCE   Np-XR   0   CM039599.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039603.1  REFERENCE   Np-XR   0   CM039603.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039607.1  REFERENCE   Np-XR   0   CM039607.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039611.1  REFERENCE   Np-XR   0   CM039611.1  NO_PHASE_BLOCK  NO_SUBRANGE
#NAME   SENSE   SAMPLE  HAPLOTYPE   LOCUS   PHASE_BLOCK SUBRANGE
Np-XR#0#CM039615.1  REFERENCE   Np-XR   0   CM039615.1  NO_PHASE_BLOCK  NO_SUBRANGE

and also by inspecting the chromosome splitting log, ex:

cat chrom-subproblems/minigraph.split.log  | grep -i purp
Query contig is ambiguous: id=LA-PurpleR|CM036160.1  len=123138761 cov=0.0210869 (vs 0.25) uf=59.2591 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036184.1  len=101319113 cov=0.000205035 (vs 0.25) uf=1.2751 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036168.1  len=104873904 cov=0.0376559 (vs 0.25) uf=184.366 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036192.1  len=89873210 cov=0.017344 (vs 0.25) uf=28.598 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036176.1  len=114710991 cov=0.0207017 (vs 0.25) uf=306.691 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036200.1  len=89652498 cov=0.0136949 (vs 0.25) uf=38.1394 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036208.1  len=66467257 cov=0.000533962 (vs 0.25) uf=1.40888 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036216.1  len=88900448 cov=0.00520676 (vs 0.25) uf=12.578 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036152.1  len=149504399 cov=0.0256767 (vs 0.25) uf=162.09 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036224.1  len=76408053 cov=0.00741138 (vs 0.25) uf=24.6052 (vs 2)

You can keep all these in the graph with --permissiveContigFilter 0. This will let the pipeline run through, but the result will be extremely sparse.

I'm not too sure what's going on. The minigraph graph, contains some signal

gfatools stat teste.sv.gfa.gz
Number of segments: 181100
Number of links: 252086
Number of arcs: 504172
Max rank: 6
Total segment length: 798237885
Average segment length: 4407.719
Sum of rank-0 segment lengths: 685746232
Max degree: 5
Average degree: 1.392
[M::main] Version: 0.4-r214-dirty
[M::main] CMD: gfatools stat teste.sv.gfa.gz
[M::main] Real time: 7.456 sec; CPU: 7.455 sec

I will try to double-check there isn't a bug, but I suspect this data may just be too repetitive to get a base-level pangenome with using mc.

syresbr commented 9 months ago

Thanks, I will talk with my supervisor and see what we can do. You are awesome.