JustinChu / JupiterPlot

A Circos-based tool to visualize genome assembly consistency or synteny between assemblies.
GNU General Public License v3.0
68 stars 11 forks source link

Synteny graph not representing alignment results: I appreciate your help #48

Closed BioJules closed 3 weeks ago

BioJules commented 1 month ago

HI Justin,

First of all, thanks a lot for your Pipeline. It helped me a lot for showing my preliminary results. Here I aligned my repeat-masked genome assembly to a reference genome (95% of the reference genome is covered) with minimap2 and got a really nice graph. However, I encounter a problem with my final results. I aligned my non repeat masked genome assembly to another genome. I did the same as for the preliminary results. The alignment reveals that 94% of the reference genome are covered by my scaffolds. The synteny plot that was created by the pipeline does not show this (see below). I have tried different settings but the pipeline shows only a few of my scaffolds and almost nothing changed with the different settings.

  1. jupiter name=NSCvsHLE_T4 ref=NSC_Genome_16Chr_XY_unwrapped.fasta fa=HLE_GenomeAssembly_241scaffolds_unwrapped.fa sam=HLEvsNSC_nonMaskedT4.sam t=18 labels=both maxGap=15000

  2. jupiter name=HLEvsNSC_T5 ref=NSC_Genome_16Chr_XY_unwrapped.fasta fa=HLE_GenomeAssembly_241scaffolds_unwrapped.fa sam=HLEvsNSC_nonMaskedT4.sam t=18 labels=both maxGap=150000

  3. jupiter name=HLEvsNSC_T6 ref=NSC_Genome_16Chr_XY_unwrapped.fasta fa=HLE_GenomeAssembly_241scaffolds_unwrapped.fa sam=HLEvsNSC_nonMaskedT4.sam t=18 m=50000 maxScaff=60 i=360 g=150000 gScaff=150000 labels=both maxGap=150000 minBundleSize=5000 MAPQ=0

I would appreciate it if you could help me getting a graph that represents the results. I don't understand why some of the reference chromosomes are not covered. Thanks a lot Best wishes Julia

NSCvsHLE_T4

JustinChu commented 1 month ago

The pipeline filters smaller contigs. Try setting ng=100 or even higher.

BioJules commented 4 weeks ago

Thanks a lot, now I figured out how to show the number of scaffolds. However, I encountered another issue. In the file below, the single lines (links) that are crisscrossing are irritating. These are probably repeats. How can I get rid of them so that I can see the main links better. Thank you a lot. MANvsHLE_T5

JustinChu commented 4 weeks ago

What is the MAPQ setting? Increasing the MAPQ threshold can help with that if they are indeed repeats though it may be noted that depending on how repetitive the genome is there could be collateral damage in a sense on other links as well. Generally allowing it to filter even up to 60 can be safe if there isn't a high repeat rate.

If this doesn't work there is a good chance these might not be repeats. However if they are small enough you could filter using minBundleSize though this is purely a length based threshold so it can remove real links you may be interested in.

BioJules commented 4 weeks ago

Thank you for your fast reply. I tried different parameter combinations and nothing changed in the graph. The max mapping quality is 60 and my general filter was MAPQ=50.

Here are commands 1) jupiter name=MANvsHLE_T5 ref=GenomeAssembly_unwrapped.fa sam=Test.sam t=40 labels=both maxGap=20000 maxScaff=36 MAPQ=50 ng=99 m=5000000

2) jupiter name=MANvsHLE_T5 ref=GenomeAssembly_unwrapped.fa sam=Test.sam t=40 labels=both maxGap=20000 maxScaff=36 MAPQ=50 ng=99 m=5000000 minBundleSize=3000000

3) jupiter name=MANvsHLE_T5 ref=GenomeAssembly_unwrapped.fa sam=Test.sam t=40 labels=both maxGap=20000 maxScaff=36 MAPQ=50 ng=99 m=5000000 minBundleSize=10000000

4) jupiter name=MANvsHLE_T5 ref=GenomeAssembly_unwrapped.fa sam=Test.sam t=40 labels=both maxGap=20000 maxScaff=36 MAPQ=55 ng=99 m=5000000 minBundleSize=10000000

The graphs look all the same (as the one in my last post). It seems that the parameter minBundleSize has no effect at al. I attach the log file for the last run. Jupiter_MAN_T5_1012875.scyld.localdomain.log

What am I doing wrong? Thanks again Julia

JustinChu commented 4 weeks ago

Hmm, how large do you think those repeats? Depends what the size you might need to use a really large minBundleSize. Or maybe there is a bug which you can check by setting it to a really small value and seeing if there isn't a change in the opposite direction (more links).

JustinChu commented 4 weeks ago

Do you see any change at all changing the values? I'd expect it to start removing some small elements at some point. However, for perspective if that genome is human sized I'd say the small contiguous chunks could be hundreds of megabases in size so I'd keep going larger until you start seeing entire large syntenic regions you want to keep disappear.

BioJules commented 4 weeks ago

I will run it with a low number of minBundleSize. However, as the m is set to 5mio (only use genomic reference chromosomes larger than this value) I thought that setting minBundleSize to a similar or even bigger amount would only show ribbons that are at least as wide as the smallest reference ideogram.

I will keep you updated

BioJules commented 4 weeks ago

The result with the minBundleSize=10 is identical to the ones with high minBundleSize. Is it a bug? BTW the lines 13 and 15 of my log file are commented out. Both commands are related to linkCollaps. Is this important?

JustinChu commented 4 weeks ago

Oh, linkCollapse.pl that was some code I was experimenting on a long time ago. I actually forgot I left it in there (probably should be removed). However, it shouldn't be an issue.

Sanity check with maxGap=1 and minBundleSize=1. You can also set minBundleSize to larger than your largest contig; it should then not render any links at all.

Identical plots in general are a bit concerning. I don't think your log reflects this but are you re-running the pipeline in the same location? If you are you should remove these files (if they still exist) otherwise it will use old files and render the same plot multiple times:


prefix.svg
prefix.links.final
prefix.rv.links
prefix.fw.links
prefix.karyotype
prefix.conf

The pipeline design was to allow me to prototype while not having to re-rerun alignment stage of the pipeline but it can lead to issues. I think I might try to make some changes in the future to the code that automatically detects parameter changes to the runs to specific sections of the pipeline if other runs so this doesn't happen.

JustinChu commented 4 weeks ago

This is what you should see roughly speaking, On a drosophila dataset:

test If running maxGap=1 and minBundleSize=1

test If running minBundleSize=100000000000

BioJules commented 3 weeks ago

Hi Justin,

I have no idea what happened but I just reinstalled the Jupiter Pipeline with conda and now the parameters have an effect again. The first time I have installed it from source (git clone https://github.com/JustinChu/JupiterPlot.git). So maybe there is a bug in the source file.

I appreciate your help. Have a wonderful day. Julia

JustinChu commented 3 weeks ago

Hmm if the issue was getting the same plot I suspect it had to with it accidentally using older files in the same directory but I can't say for certain and I might update the code to compensate for this in the future.

Anyway, I'm happy it worked for you.