Snarl with many nodes - Githubissues

IsaacDiaz026 commented 1 year ago

Hello,

I generated a pangenome graph using 16 haplotypes, and would like to use the default graph (rather than allele freq. filtered graph) for read mapping with giraffe.

However, when I check the number of nodes in the 5 largest snarls using vg stats -b graph.dist | sort -nr -k 3 | head -n 5 I get the following results.

The largest five snarls have 120408, 2452, 987, 893, 869 nodes. Do you have any advice on how to investigate the first snarl to understand why it has so many nodes?

Any help would be appreciated. Thanks!

glennhickey commented 1 year ago

You can check where in the genome your snarl is by trying to look up the endpoints with vg find -n <NODE_ID -c 1 | vg view -. This should give you the reference paths it intersects (increase -c if it doesn't). Once you know the name of the reference path, you can find the position of the node on it with vg find -P. You might also be able to check the (raw) VCF output, and find the giant site that corresponds to your snarl (the variant ID will be the snarls endpoints from the table). You can also try vg chunk -p ref-path:start-end -S <snarls> -O gfa > chunk.gfa then sending that gfa into Bandage-NG to visualize your snarl.

glennhickey commented 1 year ago

It's start ID | end ID | nodes | depth.

On Tue, Aug 15, 2023 at 4:10 PM Isaac Diaz @.***> wrote:

I am a bit confused on how to get the node id? I was able to find the snarl with many nodes using this command vg stats -b graph.dist | sort -nr -k 3 | head -n 1

which results in

28844828 | 280417 | 120408 | 1

I understand that the 3rd column is the number of nodes, but I do not know what the other columns are.

— Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/cactus/issues/1128#issuecomment-1679539323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373UPAV2NS36EURDNIATXVPJT7ANCNFSM6AAAAAA3L4GGYU . You are receiving this because you commented.Message ID: @.***>

IsaacDiaz026 commented 1 year ago

Is it possible for a snarl to span multiple reference chromosomes? When I run: vg stats -b graph.dist | sort -nr -k 3 | head -n 5 The largest snarl has many nodes and a depth of 1

start_id | end_id | # NODES | depth 28844828 | 280417 | 120408 | 1

I used vg find to locate the positions on the reference for these start and end ids

vg find -n 28844828 -x $GRAPH -c 3 > largest_snarl_start.vg vg find -n 280417 -x $GRAPH -c 3 > largest_snarl_end.vg

Using vg view I examine the reference path for the start and ends of the snarl.For the start I see: W PWN 0 Scaffold_4A 0 24 >28844823>28844825>28844826>28844828>28844829>28844830>28844831

For the end I see

W PWN 0 Scaffold_8A 0 44 >280412>280413>280414>280415>280417>280418>280419>280420>280956

They are on two different chromosomes of the reference. My reference is chromosome level. When constructing the pan genome graph I did not use cactus-graph map-split. Could this be the issue?

glennhickey commented 1 year ago

I was going to say this was impossible then I read your last line -- yeah, you are guaranteed one connected component per reference contig, but only when going through cactus-graphmap-split or cactus-pangenome. If you're not using that, all bets are off and it is free to collapse together different reference contigs if they have enough similarity. This could certainly jack up your snalr size.

IsaacDiaz026 commented 1 year ago

Is there a way to remove this one snarl (to avoid running graph map-split). I am running into an error when trying to run graph map split. It seems to fail at the rgfa-split step, here is the error message.

RuntimeError: Command ['rgfa-split', '-p', '/bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/workdir/0bac023a9c715b469924ec8485b88a90/17d0/bcc7/tmpalw_b9j2/mg.paf', '-b', '/bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/workdir/0bac023a9c715b469924ec8485b88a90/17d0/bcc7/tmpalw_b9j2/split_', '-Q', '2', '-a', '_AMBIGUOUS_', '-L', '/bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/workdir/0bac023a9c715b469924ec8485b88a90/17d0/bcc7/tmpalw_b9j2/split.log', '-n', '0.75', '-n', '0.5', '-n', '0.25', '-T', '100000', '-T', '1000000', '-g', '/bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/workdir/0bac023a9c715b469924ec8485b88a90/17d0/bcc7/tmpalw_b9j2/mg.gfa', '-G', '-r', 'id=PWN|', '-A', '5'] signaled SIGABRT: stderr=terminate called after throwing an instance of 'std::invalid_argument'

I checked all the arguments I gave in my initial command and it looks fine to me. This is the command I used

` cactus-graphmap-split jobstore test_seqfile.txt first_run.gfa first_run.paf --workDir /bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/workdir --outDir /bigdata/seymourlab/idiaz026/Citrus_non-coding_evolution/2023-03-15_pangenome_assembly/graphmap_split --reference PWN --binariesMode local

ComparativeGenomicsToolkit / cactus

Snarl with many nodes #1128