bcgsc / ntSynt

Detecting multi-genome synteny using minimizer graph mapping
GNU General Public License v3.0
63 stars 1 forks source link

Computing synteny blocks between four genomes - missing links in ribbon plot #36

Closed mariamadrid19 closed 4 months ago

mariamadrid19 commented 4 months ago

Hello! I am trying to compare 4 assemblies. Everything is working great except when I try to plot the links between them, the figure I get is missing the links in the middle. I don't understand why this is happening and would appreciate the feedback. This is the plot it's generating:

synteny_all_haps

And this is the R script I used. I had to load dplyr because the "arrange" function wasn't working and remove the "no_x_axis" function because R couldn't find it.

#!/usr/bin/env Rscript
library(argparse)
library(gggenomes)
library(gtools)
library(scales)
library(dplyr)

# Example script for generating ntSynt synteny ribbon plots using gggenomes

# Parse the input arguments
parser <- ArgumentParser(description = "Plot the ntSynt synteny blocks using gggenomes")
parser$add_argument("-s", "--sequences", help = "Input sequence lengths TSV", required = TRUE)
parser$add_argument("-l", "--links", help = "Synteny block links", required = TRUE)
parser$add_argument("--scale", help = "Length of scale bar in bases (default 1 Gbp)", default = 1e9,
                    required = FALSE, type = "double")
parser$add_argument("-p", "--prefix",
                    help = "Output prefix for PNG image (default synteny_gggenomes_plot)", required = FALSE,
                    default = "synteny_gggenomes_plot")

args <- parser$parse_args()

# Read in and prepare sequences
sequences <- read.csv(args$sequences, sep = "\t", header = TRUE)

# https://stackoverflow.com/questions/32378108/using-gtoolsmixedsort-or-alternatives-with-dplyrarrange
mixedrank <- function(x) order(gtools::mixedorder(x))
sequences <- sequences %>%
  arrange(mixedrank(seq_id))

# Read in and prepare synteny links
links_ntsynt <- read.csv(args$links,
                         sep = "\t", header = TRUE)
links_ntsynt$seq_id <- factor(links_ntsynt$seq_id,
                              levels = mixedsort(unique(links_ntsynt$seq_id)))
links_ntsynt <- links_ntsynt[mixedorder(links_ntsynt$seq_id), ]
links_ntsynt$seq_id2 <- as.character(links_ntsynt$seq_id2)
links_ntsynt$colour_block <- as.factor(links_ntsynt$colour_block)

# Prepare scale bar data frame
scale_bar <- tibble(x = c(0), xend = c(args$scale),
                    y = c(0), yend = c(0))

# Infer best units for scale bar
label <- paste(args$scale, "bp", sep = " ")
if (args$scale %% 1e9 == 0) {
  label <- paste(args$scale / 1e9, "Gbp", sep = " ")
} else if (args$scale %% 1e6 == 0) {
  label <- paste(args$scale / 1e6, "Mbp", sep = " ")
} else if (args$scale %% 1e3 == 0) {
  label <- paste(args$scale / 1e3, "kbp", sep = " ")
}

# Make the ribbon plot - these layers can be fully customized as needed!
make_plot <- function(links, sequences, add_scale_bar = FALSE) {

  p <-  gggenomes(seqs = sequences, links = links)
  plot <- p + theme_gggenomes_clean(base_size = 15) +
    geom_link(aes(fill = colour_block), offset = 0, alpha = 0.5) +
    geom_seq(size = 2, colour = "grey") + # draw contig/chromosome lines
    geom_bin_label(aes(label = bin_id), size = 6, hjust = 0.9) + # label each bin
    #geom_seq_label(aes(label = seq_id), vjust = 1.1, size = 4) + # Can add seq labels if desired
    theme(axis.text.x = element_text(size = 25),
          legend.position = "bottom") +
    scale_fill_manual(values = hue_pal()(length(unique(links$seq_id))),
                      breaks = unique(links$seq_id)) +
    scale_colour_manual(values = c("red")) +
    guides(fill = guide_legend(title = "", ncol = 10),
           colour = guide_legend(title = ""))

  if (add_scale_bar) {
    plot <- plot + geom_segment(data = scale_bar, aes(x = x, xend = xend, y = y, yend = yend),
                                linewidth = 1.5) +
      geom_text(data = scale_bar, aes(x = x + (xend / 2), y = y - 0.3, label = label))
  }

  return(plot)

}

synteny_plot <- make_plot(links_ntsynt, sequences, add_scale_bar = TRUE)

# Save the ribbon plot
ggsave(paste(args$prefix, ".png", sep = ""), synteny_plot,
       units = "cm", widt = 50, height = 20, bg = "white")

cat(paste("Plot saved:", paste(args$prefix, ".png", sep = ""), "\n", sep = " "))
lcoombe commented 4 months ago

Hi @mariamadrid19,

Thanks for reaching out! I have seen that before - I thought I had fixed the issue, but looks like there could still be some considerations I'm missing for some inputs..

A couple follow-up questions to us troubleshoot:

Thank you for your interest in ntSynt! Lauren

mariamadrid19 commented 4 months ago

Hi Lauren, Thanks for quick response!

  1. Yes, I'm using the most recent commit
  2. For sure, these are the first lines of the links.tsv file

block_id seq_id bin_id start end seq_id2 bin_id2 start2 end2 strand block_ori colour_block 0 scaffold_1 dh1_50_scaffolds.fasta 127771 10241253 scaffold_1_RagTag sorted_dh2.fasta 353864 10513305 + + scaffold_1 0 scaffold_1_RagTag sorted_dh2.fasta 353864 10513305 scaffold_1_RagTag sorted_nh1.fasta 391161 10456794 + + scaffold_1 0 scaffold_1_RagTag sorted_nh1.fasta 391161 10456794 scaffold_1_RagTag sorted_nh2.fasta 481757 10635563 + + scaffold_1 1 scaffold_1 dh1_50_scaffolds.fasta 12669193 14497695 scaffold_1_RagTag sorted_dh2.fasta 10556355 12368144 + + scaffold_1 1 scaffold_1_RagTag sorted_dh2.fasta 10556355 12368144 scaffold_1_RagTag sorted_nh1.fasta 12943737 14741589 + + scaffold_1 1 scaffold_1_RagTag sorted_nh1.fasta 12943737 14741589 scaffold_1_RagTag sorted_nh2.fasta 13047019 14852858 + + scaffold_1 2 scaffold_1 dh1_50_scaffolds.fasta 16898655 17501766 scaffold_1_RagTag sorted_dh2.fasta 12408555 12988155 + + scaffold_1 2 scaffold_1_RagTag sorted_dh2.fasta 12408555 12988155 scaffold_1_RagTag sorted_nh1.fasta 17190019 17784012 + + scaffold_1 2 scaffold_1_RagTag sorted_nh1.fasta 17190019 17784012 scaffold_1_RagTag sorted_nh2.fasta 15820487 16417467 + + scaffold_1 3 scaffold_1 dh1_50_scaffolds.fasta 17523763 20283963 scaffold_1_RagTag sorted_dh2.fasta 13004164 15889460 + + scaffold_1 3 scaffold_1_RagTag sorted_dh2.fasta 13004164 15889460 scaffold_1_RagTag sorted_nh1.fasta 17797545 20663198 + + scaffold_1 3 scaffold_1_RagTag sorted_nh1.fasta 17797545 20663198 scaffold_1_RagTag sorted_nh2.fasta 17978199 20924186 + + scaffold_1 4 scaffold_1 dh1_50_scaffolds.fasta 20352919 20773473 scaffold_1_RagTag sorted_dh2.fasta 15922344 16347458 - - scaffold_1 4 scaffold_1_RagTag sorted_dh2.fasta 15922344 16347458 scaffold_1_RagTag sorted_nh1.fasta 20690755 21114805 + - scaffold_1 4 scaffold_1_RagTag sorted_nh1.fasta 20690755 21114805 scaffold_1_RagTag sorted_nh2.fasta 20975701 21402802 + - scaffold_1 5 scaffold_1 dh1_50_scaffolds.fasta 20807163 22392208 scaffold_1_RagTag sorted_dh2.fasta 16415892 18070887 + + scaffold_1 5 scaffold_1_RagTag sorted_dh2.fasta 16415892 18070887 scaffold_1_RagTag sorted_nh1.fasta 21128295 22747929 + + scaffold_1 5 scaffold_1_RagTag sorted_nh1.fasta 21128295 22747929 scaffold_1_RagTag sorted_nh2.fasta 21459278 23158762 + + scaffold_1 6 scaffold_10 dh1_50_scaffolds.fasta 1420700 4717528 scaffold_10_RagTag sorted_dh2.fasta 3018 3310683 + + scaffold_10 6 scaffold_10_RagTag sorted_dh2.fasta 3018 3310683 scaffold_10_RagTag sorted_nh1.fasta 3177135 6523089 + + scaffold_10 6 scaffold_10_RagTag sorted_nh1.fasta 3177135 6523089 scaffold_10_RagTag sorted_nh2.fasta 3261020 6637663 + + scaffold_10 7 scaffold_10 dh1_50_scaffolds.fasta 5087916 6120902 scaffold_10_RagTag sorted_dh2.fasta 3327575 4338644 + + scaffold_10 7 scaffold_10_RagTag sorted_dh2.fasta 3327575 4338644 scaffold_10_RagTag sorted_nh1.fasta 6877692 7861286 + + scaffold_10 7 scaffold_10_RagTag sorted_nh1.fasta 6877692 7861286 scaffold_10_RagTag sorted_nh2.fasta 7009219 8013787 + + scaffold_10 8 scaffold_10 dh1_50_scaffolds.fasta 6382001 6754520 scaffold_10_RagTag sorted_dh2.fasta 4578058 4942654 + + scaffold_10 8 scaffold_10_RagTag sorted_dh2.fasta 4578058 4942654 scaffold_10_RagTag sorted_nh1.fasta 7861773 8276295 + + scaffold_10 8 scaffold_10_RagTag sorted_nh1.fasta 7861773 8276295 scaffold_10_RagTag sorted_nh2.fasta 8273514 8667302 + + scaffold_10 9 scaffold_10 dh1_50_scaffolds.fasta 7091432 9468155 scaffold_10_RagTag sorted_dh2.fasta 4952344 7311704 + + scaffold_10 9 scaffold_10_RagTag sorted_dh2.fasta 4952344 7311704 scaffold_10_RagTag sorted_nh1.fasta 8696863 11108309 + + scaffold_10 9 scaffold_10_RagTag sorted_nh1.fasta 8696863 11108309 scaffold_10_RagTag sorted_nh2.fasta 9092319 11466626 + + scaffold_10 10 scaffold_10 dh1_50_scaffolds.fasta 9567861 10314791 scaffold_10_RagTag sorted_dh2.fasta 7538746 8303917 + + scaffold_10 10 scaffold_10_RagTag sorted_dh2.fasta 7538746 8303917 scaffold_10_RagTag sorted_nh1.fasta 11203238 11904562 + + scaffold_10 10 scaffold_10_RagTag sorted_nh1.fasta 11203238 11904562 scaffold_10_RagTag sorted_nh2.fasta 11640620 12409494 + + scaffold_10 11 scaffold_10 dh1_50_scaffolds.fasta 10388709 10759026 scaffold_14_RagTag sorted_dh2.fasta 1921821 2294353 + + scaffold_10 11 scaffold_14_RagTag sorted_dh2.fasta 1921821 2294353 scaffold_10_RagTag sorted_nh1.fasta 11919620 12267090 + + scaffold_10 11 scaffold_10_RagTag sorted_nh1.fasta 11919620 12267090 scaffold_10_RagTag sorted_nh2.fasta 12468715 12830402 + + scaffold_10 12 scaffold_11 dh1_50_scaffolds.fasta 1626 936317 scaffold_11_RagTag sorted_dh2.fasta 9082717 10035245 + + scaffold_11 12 scaffold_11_RagTag sorted_dh2.fasta 9082717 10035245 scaffold_11_RagTag sorted_nh1.fasta 9149473 10077785 + + scaffold_11 12 scaffold_11_RagTag sorted_nh1.fasta 9149473 10077785 scaffold_23_RagTag sorted_nh2.fasta 8668325 9609365 + + scaffold_11 13 scaffold_11 dh1_50_scaffolds.fasta 1117585 3227950 scaffold_11_RagTag sorted_dh2.fasta 10227841 12337977 + + scaffold_11 13 scaffold_11_RagTag sorted_dh2.fasta 10227841 12337977 scaffold_11_RagTag sorted_nh1.fasta 10242511 12337086 + + scaffold_11 13 scaffold_11_RagTag sorted_nh1.fasta 10242511 12337086 scaffold_23_RagTag sorted_nh2.fasta 9642751 11758037 + + scaffold_11 14 scaffold_11 dh1_50_scaffolds.fasta 3365684 3610293 scaffold_11_RagTag sorted_dh2.fasta 12579399 12825768 + + scaffold_11 14 scaffold_11_RagTag sorted_dh2.fasta 12579399 12825768 scaffold_11_RagTag sorted_nh1.fasta 12419983 12659858 + + scaffold_11 14 scaffold_11_RagTag sorted_nh1.fasta 12419983 12659858 scaffold_11_RagTag sorted_nh2.fasta 146629 421083 + + scaffold_11 15 scaffold_11 dh1_50_scaffolds.fasta 3827112 3978375 scaffold_11_RagTag sorted_dh2.fasta 12911321 13062925 + + scaffold_11 15 scaffold_11_RagTag sorted_dh2.fasta 12911321 13062925 scaffold_11_RagTag sorted_nh1.fasta 12943806 13157704 + + scaffold_11 15 scaffold_11_RagTag sorted_nh1.fasta 12943806 13157704 scaffold_11_RagTag sorted_nh2.fasta 573870 725715 + + scaffold_11 16 scaffold_11 dh1_50_scaffolds.fasta 3978848 4046659 scaffold_11_RagTag sorted_dh2.fasta 13063398 13131145 + - scaffold_11 16 scaffold_11_RagTag sorted_dh2.fasta 13063398 13131145 scaffold_11_RagTag sorted_nh1.fasta 14136737 14207142 - - scaffold_11 16 scaffold_11_RagTag sorted_nh1.fasta 14136737 14207142 scaffold_11_RagTag sorted_nh2.fasta 726188 799664 - - scaffold_11 17 scaffold_11 dh1_50_scaffolds.fasta 4082815 4263724 scaffold_11_RagTag sorted_dh2.fasta 13167332 13353628 + + scaffold_11 17 scaffold_11_RagTag sorted_dh2.fasta 13167332 13353628 scaffold_11_RagTag sorted_nh1.fasta 14244453 14434749 + + scaffold_11 17 scaffold_11_RagTag sorted_nh1.fasta 14244453 14434749 scaffold_11_RagTag sorted_nh2.fasta 835842 1027534 + + scaffold_11 18 scaffold_11 dh1_50_scaffolds.fasta 4958836 5018132 scaffold_11_RagTag sorted_dh2.fasta 14045546 14104805 + - scaffold_11 18 scaffold_11_RagTag sorted_dh2.fasta 14045546 14104805 scaffold_11_RagTag sorted_nh1.fasta 15344045 15415395 - - scaffold_11 18 scaffold_11_RagTag sorted_nh1.fasta 15344045 15415395 scaffold_11_RagTag sorted_nh2.fasta 1920062 1991302 + - scaffold_11 19 scaffold_11 dh1_50_scaffolds.fasta 5262159 5754322 scaffold_11_RagTag sorted_dh2.fasta 14344236 14836685 + + scaffold_11 19 scaffold_11_RagTag sorted_dh2.fasta 14344236 14836685 scaffold_11_RagTag sorted_nh1.fasta 14439090 14944729 + + scaffold_11 19 scaffold_11_RagTag sorted_nh1.fasta 14439090 14944729 scaffold_11_RagTag sorted_nh2.fasta 1031941 1525539 + + scaffold_11 20 scaffold_11 dh1_50_scaffolds.fasta 5755719 5818193 scaffold_11_RagTag sorted_dh2.fasta 14838082 14894928 + + scaffold_11 20 scaffold_11_RagTag sorted_dh2.fasta 14838082 14894928 scaffold_11_RagTag sorted_nh1.fasta 13513757 13577354 + + scaffold_11 20 scaffold_11_RagTag sorted_nh1.fasta 13513757 13577354 scaffold_11_RagTag sorted_nh2.fasta 1526952 1586104 + + scaffold_11 21 scaffold_11 dh1_50_scaffolds.fasta 5828277 7982186 scaffold_11_RagTag sorted_dh2.fasta 14905011 17070329 + + scaffold_11 21 scaffold_11_RagTag sorted_dh2.fasta 14905011 17070329 scaffold_11_RagTag sorted_nh1.fasta 15004490 17251407 + + scaffold_11 21 scaffold_11_RagTag sorted_nh1.fasta 15004490 17251407 scaffold_11_RagTag sorted_nh2.fasta 1594698 3827707 + + scaffold_11 22 scaffold_11 dh1_50_scaffolds.fasta 8878988 8893702 scaffold_11_RagTag sorted_dh2.fasta 17849789 17864461 - - scaffold_11 22 scaffold_11_RagTag sorted_dh2.fasta 17849789 17864461 scaffold_19_RagTag sorted_nh1.fasta 9059938 9074616 + - scaffold_11 22 scaffold_19_RagTag sorted_nh1.fasta 9059938 9074616 scaffold_34 sorted_nh2.fasta 216694 237293 + - scaffold_11 23 scaffold_11 dh1_50_scaffolds.fasta 8939931 8970529 scaffold_11_RagTag sorted_dh2.fasta 17772928 17803541 - - scaffold_11

mariamadrid19 commented 4 months ago

I tried it with only 3, instead of 4, assemblies, and that one worked perfectly fine. I can't figure out what the problem is. synteny_all_haps Could it be that the problem is with my assembly (nh2 isn't mapping correctly?)?

lcoombe commented 4 months ago

Thanks for that!

Ok, yes it looks like a sorting issue - So for the first synteny block/link from your file I see this:

0 scaffold_1 dh1_50_scaffolds.fasta 127771 10241253 scaffold_1_RagTag sorted_dh2.fasta 353864 10513305 + + scaffold_1
0 scaffold_1_RagTag sorted_dh2.fasta 353864 10513305 scaffold_1_RagTag sorted_nh1.fasta 391161 10456794 + + scaffold_1
0 scaffold_1_RagTag sorted_nh1.fasta 391161 10456794 scaffold_1_RagTag sorted_nh2.fasta 481757 10635563 + + scaffold_1

So, it is formatted in such a way that the assemblies are expected to be ordered: dh1_50_scaffolds.fasta, sorted_dh2.fasta, sorted_nh1.fasta, sorted_nh2.fasta But, it looks like the gtools::mixedorder that I'm using to sort the input assemblies is finding that slightly different order (ie sorted_nh2.fasta before sorted_nh1.fasta)?

One thing that could be an easy thing to try as a workaround (looks like the genomes are fairly small, so ntSynt should have run fairly quickly?) - could you try renaming one of sorted_nh1.fasta or sorted_nh2.fasta to something a bit more different alphabetically? Like sorted_xh2.fa or something?

I think that your assemblies and the ntSynt outputs are fine - I think it's probably just something wrong in the parsing of the links within the R script.

mariamadrid19 commented 4 months ago

Hey, it was in fact just an issue order. I re-ran it with a new name and now it looks fine. Thanks a lot! synteny_all_assemblies

lcoombe commented 4 months ago

Excellent! I'm glad that solved it! I'll see if there's anything easy on my end that I could tweak to fix that case, otherwise I'll make a note of that potential issue on the README 👍