JustinChu / JupiterPlot

A Circos-based tool to visualize genome assembly consistency or synteny between assemblies.
GNU General Public License v3.0
68 stars 11 forks source link

No links despite large regions of similarity visible in dot plot from MUMMER alignment #38

Closed mptrsen closed 12 months ago

mptrsen commented 2 years ago

There seems to be an incongruence between JupiterPlot and a "regular" dot plot from a MUMMER (nucmer) alignment of two genomes of a chlorophyte algae species (21 Mbp). JupiterPlot shows no connections, whereas the dot plot shows large regions of similarity pretty much throughout the entire genome.

JupiterPlot

MUMMER alignment

Please find the genomes and results here: https://uni-bonn.sciebo.de/s/LrubQxY319Lf8nn

I ran both JupiterPlot and MUMMER with default parameters:

cd jupiterplot
jupiter name=Micpus ref=../data/Micpus1545_MpusillaCCMP1545_228_v3.0.fasta fa=../data/MicpusNOUM17_MicpuN3v2_AssemblyScaffolds.fasta
cd mummer_dotplot
nucmer --coords ../data/Micpus1545_MpusillaCCMP1545_228_v3.0.fasta ../data/MicpusNOUM17_MicpuN3v2_AssemblyScaffolds.fasta

To visualise the MUMMER alignment, I used Dot (https://dot.sandbox.bio/). It requires input reformatted by the DotPrep.py script included in the mummer_dotplot directory. It was run as:

python DotPrep.py --delta out.delta --out Micpus

This produces the files Micpus.coords and Micpus.coords.idx that are also in the uploaded results, and can be used in the Dot viewer to reproduce the dot plot above.

I tried setting minBundleSize=1 and maxGap=9999999999 with the effect that it renders three small connections:

JupiterPlot, minBundleSize=1 maxGap=9999999999

Do I need to set other parameters? Or how can I make JupiterPlot's output show the obvious similarity between the two genomes that is visible in the dot plot?

Sven-Winter commented 2 years ago

Hi Malte,

I have similar issues with JupiterPlot, yet on a slightly different scale. I want to compare two highly conserved, closely related species, and 18 chromosomes match perfectly just the smallest one shows no links at all. I also tried mummer after seeing your plot, and it fits perfectly. I run JupiterPlot again, just comparing the one chromosome of the two species, and it also shows perfect synteny. So it would be really nice to know why JupiterPlot sometimes does not link clearly homologous regions.

JustinChu commented 2 years ago

I think the key would be to look at the minimap2 alignments. To fix this issue I think I would need to see the resulting bam file created in the pipeline. There could also be an element of filtering involved: https://github.com/JustinChu/JupiterPlot/blob/45676f3ee3b2f2134941cf6276a51e3bdaa07e0b/jupiter#L80 If the genome is highly repetitive it is possible that I have been filtering out links. If it is filtering related it seems I should probably include an option to allow repetitive links or change the default behaviour to not filter on mapping quality at all. I don't recall if there was a clear reason why I chose to filter on mapping quality in retrospect.

JustinChu commented 1 year ago

@mptrsen Can you try the latest commit with the MAPQ parameter set to 0?

Sven-Winter commented 1 year ago

Hi @JustinChu, for my dataset I have checked the bed file and all the alignments I am missing in the plot are there but are missing in the links, karyotype, and SeqOrder files. Filtering by mapping Q seems not to be the issue. I could also send you my two assemblies if that helps. The genomes are standard mammal genomes and very conserved between the two species so I would assume this is not an issue of repeats.

Bildschirmfoto 2022-11-04 um 11 44 45

VMU_Ajub_Pben_label_100

mptrsen commented 1 year ago

@mptrsen Can you try the latest commit with the MAPQ parameter set to 0?

Same result, sorry. I'll upload the alignment SAM next week.

JustinChu commented 1 year ago

@sven1990 Sending me the assemblies would be very helpful. Also, let me know what syntenic region you expect to see on the plot that is only visible when you subset the chromosome.

Sven-Winter commented 1 year ago

Hi @JustinChu,

thanks for looking into it.

You can download the assemblies from my dropbox [(https://www.dropbox.com/sh/rvp92mdi9pcsw4h/AAAi3HpFwvykg7HwwUp4pP2da?dl=0)]

I expect to see syntenic regions between the largest 19 scaffolds (chromosomes) of both assemblies. However, the plot so far does not show synteny between scaffold_19 of VMU_Ajub_asm_1.0 and scaffold NC_057357.1 from GCF_016509475.1.

As I mentioned before, there are alignments in the bed file, but in the end, they are missing in the plot.

Thanks again!

Sven

mptrsen commented 1 year ago

I have updated the uploaded data with the recent SAM. Hope that helps.

(samtools flagstats reports a 100% mapping rate)

I ran Jupiter with:

jupiter name=Micpus ref=reference.fa fa=scaffolds.fa MAPQ=0
Sven-Winter commented 1 year ago

Hi @JustinChu,

I know you are likely busy with other tasks but did you maybe have the chance to look into my two assemblies?

I very much appreciate your help.

JustinChu commented 1 year ago

@sven1990 I'll try to have a fix or explanation by Friday.

JustinChu commented 1 year ago

@mptrsen I think there might be a conflict of temporary files causing issues in your output. I get different results when I run the data with the same parameters minBundleSize=1 and maxGap=9999999999 : Micpus_1_9999999999

Having such a high max gap probably isn't very accurate to true synteny interpretations so here are results with minBundleSize=1 and maxGap=100000 (default): Micpus_1_100000

I also ran it off your sam file and got the same figure so I suspect it is an issue with the *links* files.

I'm not sure if this completely lines up with your MUMMER results so let me know if you see something strange.

The file conflict is something I think I can resolve the issue by making relevant parameters part of the temporary filenames so this happens less, though at the moment make sure you are removing and *links* files after changing parameters relevant to links.

JustinChu commented 1 year ago

@sven1990 The issue in your case was the ng parameter was filtering the scaffolds that you were interested in. I honestly should have recognized it sooner. Sorry, I took so long. Here are the results with ng=120: test

Perhaps I should change the default as assemblies are becoming far more contiguous these days. Maybe a more useful filtering parameter rather than ng would be a minimum size like the parameter I use for the reference genome.

Sven-Winter commented 1 year ago

@JustinChu thank you a lot, really appreciate your help. I did not know you could set ng >100.