estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

Comparison of large genomes.. messy plot #12

Closed AstrMary closed 3 years ago

AstrMary commented 3 years ago

Dear Esteban,

I used chromeister tool to compare two varieties of olive plants (Each one was over 3 Gbp ) using the command "./CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 && Rscript compute_score-nogrid.R dotplot.mat 1000" Although I got the percentage of similarity the plot was messy enough

Then I run /CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 -dimension 2000 && Rscript compute_score-nogrid.R dotplot.mat 2000 But the plot was messy again...

The genomes that I compare have hundreds of contigs, and thus the figure looks messy. Could you provide some instructions on tuning the parameters to generate a better plot ?

Thanks a lot, Mary

estebanpw commented 3 years ago

Dear Mary,

what do you mean by messy? Can you share the dotplot? Do you mean something like the Multi-fasta example in the Readme (find it here) but on a larger scale?

What score are you getting?

If there are a lot of contigs and these are unordered, then there is nothing that can be done in chromeister to make it less "messy". Also, what is your aim with the comparison?

If you could share the dotplot (or a section of it, not sure if you have privacy constraints) then I might be able to help : - )

AstrMary commented 3 years ago

Dear Esteban, thank you for your prompt response. My score is 0.996 setting z =1. If I run the command : ./CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 && Rscript compute_score.R dotplot.mat 1000 i get the below plot

dotplot mat filtWithGrids

If I run this command : /CHROMEISTER -query Genome1.fa -db /Genome2.fa -out dotplot.mat -diffuse 1 && Rscript compute_score-nogrid.R dotplot.mat 1000 I get the following plot : dotplot mat filt

I expect to get something like this :

dotplot mat filtsame

We try to decide our reference genome, so, my aim is to compare these two varieties of olea in order to define their similarity . Yes you are right, the genomes of these two varieties have many many contigs, so it might be this the problem.

your help would be more than welcome :-) Many thanks, Maria

estebanpw commented 3 years ago

Dear Mary,

thanks for the info. I think that the problem might be this one:

There are so many contigs (around ~40k in one of the genomes) that none of them has enough length on its own for chromeister to consider it an interesting signal (and it might be filtering them). This would be aggravated if the contigs were unordered (which I assume they are, since otherwise you would have probably assembled them into larger scaffolds, is this true?)

In any case I have just included a new script unfiltered_plot.R that should be able to plot without filtering anything, so that you should also be able to see the matches. You can run this script similarly to what you were doing:

(remember to git pull origin first in your local repository)

./CHROMEISTER -query Genome1.fa -db Genome2.fa -out dotplot.mat -diffuse 1 && Rscript unfiltered_plot.R dotplot.mat 1000

For instance, running Homo sapiens chrX with Mus musculus chrX with such script generates:

HX-MX-unfiltered

Note that the plot is now more "blurry" as it includes single matches which are considered "noise" in the original CHROMEISTER pipeline.

And in a case that might be similar to yours, this is a comparison of two contigs file (from the same species):

contigs-contigs

As you can see there are matches between the contigs, but these are scattered around according to the order in the files.

However, if it is in fact the case that contigs are small and unordered, then I doubt that you will get a straight diagonal, but rather a lot of scattered points (such as in the previous contigs example). Let me know if this helps and please post here the new results that you get with the unfiltered script.

Bests, Esteban

AstrMary commented 3 years ago

Dear Esteban, thank you so much for your new script !! you really helped me a lot

I just run it and I have the below plot

dotplot mat filt

Yea, it seems that the contigs are small and unordered... I will try to find a way to order them and then I will run the new script again..

As soon as I have the new results I will post them :-)

Thank you for your time, Maria

estebanpw commented 3 years ago

I am happy it helped!

Btw: if you have a reference genome (even if its contigs or scaffolds, as long as they are in order) then you can compare the unordered one with the reference and sort the contigs according to the coordinates of the alignments they are matched to. Also, let me know if you find a better way to order them, since its a problem we have had in the past at our lab.

Bests, esteban

AstrMary commented 3 years ago

The truth is that from my side is the first time that I face this kind of problem, so your advice is very helpful.. Of course, I will post the results with a description of my analysis and I will be glad to read your comments. Kind thanks, Maria

AstrMary commented 3 years ago

Dear Esteban, I hope you are well. Further to our previous conversation, I checked the quality of the two assemblies and I used the assembly with the best quality to order the other assembly. The tool that I used for the ordering is the Mauve and I run it through command line "java -Xmx500m -cp Mauve.jar org.gel.mauve.contigs.ContigOrderer -output results_dir -ref reference.gbk -draft draft.fasta" Then I run your tool "./CHROMEISTER -query ordered.fasta -db reference.fasta -out dotplot.mat -diffuse 1 && Rscript compute_score-nogrid.R dotplot.mat 1000" and I got this plot :

dotplot mat filtWithNoFilter Then I run the following command and I got "./CHROMEISTER -query ordered.fasta -db reference.fasta -out dotplot.mat -diffuse 1 && Rscript unfiltered_plot.R dotplot.mat 1000" dotplot mat filtered ordered

Any comments :-) ? Thank you again for your help Maria

estebanpw commented 3 years ago

Dear Maria,

its looking much better now! I would say that the diagonal is "there", just some contigs/scaffolds seem to be missing or not in the correct order.

CHROMEISTER wont get you much further though, as its mostly aimed to produce the "big picture". Do you need the actual alignments?

Also:

The tool that I used for the ordering is the Mauve and I run it through command line "java -Xmx500m -cp Mauve.jar org.gel.mauve.contigs.ContigOrderer -output results_dir -ref reference.gbk -draft draft.fasta"

Didnt know about mauve doing that. Thanks for letting me know!

Bests, Esteban

AstrMary commented 3 years ago

Dear Esteban,

You tool is exactly what I was looking for !

Thank you again for you help, Maria