Closed syresbr closed 10 months ago
That message is a warning [W] and not an error, and should not cause a failure on most environments (I can see how it's confusing). If your job is failing, please share the whole log and maybe we can see where the real issue is.
This is getting through most of the pipeline, then failing when creating the VCF.
RuntimeError: Command /usr/bin/time -f "CACTUS-LOGGED-MEMORY-IN-KB: %M" bash -c 'set -eo pipefail && vg deconstruct /home/gustavo.carvalho/pangenome/teste/f31aa1dfa6265f7da22db6c159ea7606/7d7c/dc82/tmpvjtsnh0y/clip.teste.gbz -P LA-PurpleR -C -a -r /home/gustavo.carvalho/pangenome/teste/f31aa1dfa6265f7da22db6c159ea7606/7d7c/dc82/tmpvjtsnh0y/clip.teste.snarls -t 47 -O | bgzip --threads 47' exited 1: stderr=Error [vg deconstruct]: No specified reference path or prefix found in graph
In particular, it's complaining that LA-PurpleR
isn't in the graph. I'm not sure why that would be. There's a warning about a different genome
WARNING: Sample R570.5 has mash distance 0.0531179 from the reference. A value this high likely means your data is too diverse to construct a useful pangenome graph from.
but not LA-PurpleR
. Another unrelated issue I see is that you are naming the other genomes R570.1-5
. This will specify a single 5-ploid genome (which should be okay -- hopefuly that's what you want).
I'd suggest rerunning (you can use cactus-graphmap-join
to pick up the end of the process) without the --vcfReference
option. From what I can tell, it should all work. From there you can hopefully use the output to see what happened to LA-PurpleR
. If you can share the input fastas with me, I can also try to reproduce here.
This is the input file, I will try to run the rest. Genomesteste.txt Will keep in touch Many thanks.
Thanks for sharing the data, I can reproduce. The problem is that nothing is aligning, and this is tripping up the VCF export since only the first reference is in the graph.
You can check this by looking at the intermediate output as follows
for graph in chrom-alignments/*.vg ; do vg paths -Mx $graph | grep -v MINIGRAPH ; done
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039579.1 REFERENCE Np-XR 0 CM039579.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039583.1 REFERENCE Np-XR 0 CM039583.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039587.1 REFERENCE Np-XR 0 CM039587.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039591.1 REFERENCE Np-XR 0 CM039591.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039595.1 REFERENCE Np-XR 0 CM039595.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039599.1 REFERENCE Np-XR 0 CM039599.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039603.1 REFERENCE Np-XR 0 CM039603.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039607.1 REFERENCE Np-XR 0 CM039607.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039611.1 REFERENCE Np-XR 0 CM039611.1 NO_PHASE_BLOCK NO_SUBRANGE
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
Np-XR#0#CM039615.1 REFERENCE Np-XR 0 CM039615.1 NO_PHASE_BLOCK NO_SUBRANGE
and also by inspecting the chromosome splitting log, ex:
cat chrom-subproblems/minigraph.split.log | grep -i purp
Query contig is ambiguous: id=LA-PurpleR|CM036160.1 len=123138761 cov=0.0210869 (vs 0.25) uf=59.2591 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036184.1 len=101319113 cov=0.000205035 (vs 0.25) uf=1.2751 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036168.1 len=104873904 cov=0.0376559 (vs 0.25) uf=184.366 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036192.1 len=89873210 cov=0.017344 (vs 0.25) uf=28.598 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036176.1 len=114710991 cov=0.0207017 (vs 0.25) uf=306.691 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036200.1 len=89652498 cov=0.0136949 (vs 0.25) uf=38.1394 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036208.1 len=66467257 cov=0.000533962 (vs 0.25) uf=1.40888 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036216.1 len=88900448 cov=0.00520676 (vs 0.25) uf=12.578 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036152.1 len=149504399 cov=0.0256767 (vs 0.25) uf=162.09 (vs 2)
Query contig is ambiguous: id=LA-PurpleR|CM036224.1 len=76408053 cov=0.00741138 (vs 0.25) uf=24.6052 (vs 2)
You can keep all these in the graph with --permissiveContigFilter 0
. This will let the pipeline run through, but the result will be extremely sparse.
I'm not too sure what's going on. The minigraph
graph, contains some signal
gfatools stat teste.sv.gfa.gz
Number of segments: 181100
Number of links: 252086
Number of arcs: 504172
Max rank: 6
Total segment length: 798237885
Average segment length: 4407.719
Sum of rank-0 segment lengths: 685746232
Max degree: 5
Average degree: 1.392
[M::main] Version: 0.4-r214-dirty
[M::main] CMD: gfatools stat teste.sv.gfa.gz
[M::main] Real time: 7.456 sec; CPU: 7.455 sec
I will try to double-check there isn't a bug, but I suspect this data may just be too repetitive to get a base-level pangenome with using mc.
Thanks, I will talk with my supervisor and see what we can do. You are awesome.
I'm trying to run the tutorial 12/2023 South Africa Refgraph Hackathon with my data (just a fraction of all the data that I will use to check if I'm doing the things right) on a cluster it only generate the GFA file but it has problem with the disk saying it needs more and reviewing the logs it doesn't happens in the main thread only on the thread 4. It shows this error
[Thread-4 (statsAndLoggingAggregator)] [W] [toil.statsAndLogging] Got message from job at time 01-08-2024 14:08:24: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-odgi_squeeze/instance-6ksil339/cleanup/file-cee5c12f46104224972c79efec5f5a07/stream used 103.49% disk (1.3 GiB [1406898176B] used, 1.3 GiB [1359448456B] requested).
My code is this :
Thank you for your patience.