claczny / VizBin

Repository of our application for human-augmented binning
27 stars 14 forks source link

how to define the color in vizbin? #33

Closed yoyohashao closed 8 years ago

yoyohashao commented 8 years ago

Hello Cedric, I wanna re-plot my clusters with new colors instead of default blue. So I create the label file, in the label file, I left non-targeting sequences with 0, and used red, yellow, green, orange,black,pink,purple,cyan,and gold for different clusters. However, in the visualisation, there were much more red dots than I expected and some colors didn't show. If I didn't provide the label information, all sequences would be blue dots. But now it seems like red is the default color. Also, the yellow-green fusion region can be divided into 2 clusters by metabat (https://bitbucket.org/berkeleylab/metabat/wiki/Home) .

May I know the available colors and markers in vizbin? I referred to the color set in ggplot2 but clearly some colors names couldn't be used in VizBin.

Many thanks colors1test2k

claczny commented 8 years ago

Hi @yoyohashao ,

I like your idea of checking whether the automatically generated bins can be replicated using a complementary approach. We are doing this too and I have heard from several other peers that they follow such an approach.

Reagrding the labels in the annotation file, they do not designate particular colours, nor has such a function been implemented. We wanted it to be flexible so that pretty much anything could be used as a label. This means you could use numbers (1,2,3,4,5,...), or letters (a,b,c,d,e,...) or prior taxonomic assignments (E.coli, S.aureus, G.dermatophilus, ...) as labels. The colour (and shapte BTW) is then determined automatically.

If you want to pretty-plot the embedding, in particular using functionality in R, you can easily do that. All you need is the point.txt file that is created in every run. That file holds the 2D coordinates that can then be easily used in R. Where that file is created, can be found in the .log file. For ease of use, I would however recommend downloading the version of VizBin from the devel branch, if not done already. That version allows to save the "workspace", i.e., the sequences, the log, and the 2D coordinates in a ZIP-file. That workspace can then be reloaded at a later point in time and no recomputation of the 2D coordinates is required.

Regarding the "fusioned" clusters, there are many reasons why MetaBAT could have separated them. Without having a look at the automatically generated bins, I can only guess. I would then guess that these are closely related organisms that thus have very similar genomic signatures and are unlikely to be separated based on these signatures alone. After all, they are similar :) Since MetaBAT uses abundance-information across samples, i.e., fold-coverage covariation, the respective organisms might be sufficiently different in their abundances and can thus be separated. Have you checked whether the automatically generated bins are complete and homogeneous, e.g., using the often used set of 107 "essential" genes from Dupont2012/Albertsen2013?

What is with all the red points? Are they not assigned to individual clusters by MetaBAT or did you simply not highlight them?

Best,

Cedric

yoyohashao commented 8 years ago

Thanks for the reply Cedric. I changed my color names into numbers [0,non-target. 1-10: targets] and the PNG just looked like the previous one :. There should be 11 different labels but all I saw was the same picture. The dominant red was not my intention at all. In the previous pic, I specified red color but i guess vizbin just treated it as a label, like 'Ecoli'. I don't know why it's still like that this time. I found the point.txt file in workspace so probably I can call R to do the plotting but I'd rather just do it in VizBin.

The fusioned region was comprised by two closely related strains, more like the different species from same genus. MetaBat is based on tetranucleotide frequency so it might be sensitive to them. It's funny that MetaBat and Vizbin were all published in later summer so there was no comparison between vizbin and metabat. Before I tried vizbin and metabat, I used CONCOCT.

The clusters I'd like to colorize all have more than 50% of those 107 essential genes and with contamination rate lower than 10%.

Fang Liu PhD Candidate Rm. 3-410 School of Life Sciences and Biotechnology Shanghai Jiaotong University Shanghai,China 200240 https://www.researchgate.net/profile/Fang_Liu37

On Oct 22, 2015, at 20:36, Cedric Laczny notifications@github.com wrote:

Hi @yoyohashao https://github.com/yoyohashao ,

the labels in the annotation file do not designate particular colours, nor has such a function been implemented. We wanted it to be flexible so that pretty much anything could be used as a label. This means you could use numbers (1,2,3,4,5,...), or letters (a,b,c,d,e,...) or prior taxonomic assignments (E.coli, S.aureus, G.dermatophilus, ...) as labels. The colour (and shapte BTW) is then determined automatically.

If you want to pretty-plot the embedding, in particular using functionality in R, you can easily do that. All you need is the point.txt file that is created in every run. That file holds the 2D coordinates that can then be easily used in R. Where that file is created, can be found in the .log file. For ease of use, I would however recommend downloading the version of VizBin from the devel branch, if not done already. That version allows to save the "workspace", i.e., the sequences, the log, and the 2D coordinates in a ZIP-file. That workspace can then be reloaded at a later point in time and no recomputation of the 2D coordinates is required.

Regarding the "fusioned" clusters, there are many reasons why MetaBAT could have separated them. Without having a look at the automatically generated bins, I can only guess. I would then guess that these are closely related organisms that thus have very similar genomic signatures and are unlikely to be separated based on these signatures alone. After all, they are similar :) Since MetaBAT uses abundance-information across samples, i.e., fold-coverage covariation, the respective organisms might be sufficiently different in their abundances and can thus be separated. Have you checked whether the automatically generated bins are complete and homogeneous, e.g., using the often used set of 107 "essential" genes from Dupont2012/Albertsen2013?

What is with all the red points? Are they not assigned to individual clusters by MetaBAT or did you simply not highlight them?

Best,

Cedric

— Reply to this email directly or view it on GitHub https://github.com/claczny/VizBin/issues/33#issuecomment-150207258.

claczny commented 8 years ago

I changed my color names into numbers [0,non-target. 1-10: targets] and the PNG just looked like the previous one :\

This is exactly the intended behavior: What the labels are called does not matter.

The dominant red was not my intention at all.

Red-coloured "x"s are the second colour-shape combination that is created by VizBin. As such, I would guess that the red "x"s represent your first target, i.e., label "1", assuming that your annotation file starts with a "0" label and the next non-"0" label in the file is "1". My earlier question meant to know whether the labels are entirely based on the output of MetaBAT, i.e., in your case, did MetaBAT return 10 clusters (1-10: targets) and a set of contigs that were not assigned to any cluster (0: non-target)?

I specified red color but i guess vizbin just treated it as a label, like 'Ecoli'.

Yes. We are still pondering about this, but I think that it would be nice to allow the user to specify what colour and shape should be used for which label. Such functionality might be implemented along resolving Issue #22. However, for now, for pretty-plotting, I would recommend using a dedicated drawing library (ggplot in R, matplotlib in python, ...) as it offers much more control over the colour-shape choice.

There should be 11 different labels

You might want to display the legend: Right-click on the visualization as if you wanted to export the selection. The click will open a menu and you will find a Legend entry there. This should show you the different labels that VizBin has found and used. VizBin currently cycles through five colours and a bunch of symbols. The latter allows to use the same colour for different labels, e.g., in your screenshot, you have twice an orange colour but once it is an _up_ward triangle shape and once it is a _down_ward triangle shape.

It's funny that MetaBat and Vizbin were all published in later summer so there was no comparison between vizbin and metabat.

The comparison in the MetaBAT paper is focussed on automated clustering solutions, rather than solutions integrating human input for decision making. That is fine with me, as these tools (MetaBAT and VizBin) differ considerably in their underlying methodologies and are thus hard to directly compare.

MetaBat is based on tetranucleotide frequency

If you have a look at Fig.2 at https://peerj.com/articles/1165/ you can see that MetaBAT includes probabilistic modelling of TNF and abundance distances. Fig. 1 at the same link illustrates that MetaBAT uses data from multiple sample, while VizBin, in its current form, is a a single sample-based approach and has thus considerably less data/information to leverage.

Best,

Cedric

yoyohashao commented 8 years ago

Red-coloured "x"s are the second colour-shape combination that is created by VizBin. As such, I would guess that the red "x"s represent your first target, i.e., label "1", assuming that you annotation file starts with a "0" label and the next non-"0" label in the file is "1". My earlier question meant to know whether the labels are entirely based on the output of MetaBAT, i.e., in your case, did MetaBAT return 10 clusters (1-10: targets) and a set of contigs that were not assigned to any cluster (0: non-target)?

Thanks for the explanation. I didn't use the MetaBAT clusters to plot. I used the same metagenome file but created an annotation file according to the clusters I picked from VizBin and the clusters that got divided by MetaBAT. That is to say, the original PNG had the same number and pattern but all the dots were blue as there was no additional files. Yet in the new plot, I wanna pop-up some clusters by offering an annotation file.

Actually the 'red' thing in the PNG you saw should be the non-targeting sequences(label=0),which explained why there were so many red dots. And the major blue block was supposed to be 'red' (label=1).

You might want to display the legend: Right-click on the visualization as if you wanted to export the selection. The click will open a menu and you will find a Legend entry there. This should show you the different labels that VizBin has found and used. VizBin currently cycles through five colours and a bunch of symbols. The latter allows to use the same colour for different labels, e.g., in your screenshot, you have twice an orange colour but once it is an upward triangle shape and once it is a downward triangle shape.

Now I see there are ten different types in total, in terms of the combination of shapes and colors. The default is blue dot, whose value is 2. And '0' doesn't mean anything in vizbin legend. So if I don't wanna assign a specific label for a sequence, can I leave it blank in the annotation file? Maybe I should just use the clusters of interest to plot the figure I want, which will be noise-free.

It's funny that MetaBat and Vizbin were all published in later summer so there was no comparison between vizbin and metabat.

The comparison in the MetaBAT paper is focussed on automated clustering solutions, rather than solutions integrating human input for decision making. That is fine with me, as these tools (MetaBAT and VizBin) differ considerably in their underlying methodologies and are thus hard to directly compare.

MetaBat is based on tetranucleotide frequency

If you have a look at Fig.2 at https://peerj.com/articles/1165/ https://peerj.com/articles/1165/ you can see that MetaBAT includes probabilistic modelling of TNF and abundance distances. Fig. 1 at the same link illustrates that MetaBAT uses data from multiple sample, while VizBin, in its current form, is a a single sample-based approach and has thus considerably less data/information to leverage.

In terms of the completeness and contamination rate of the bins, vizbin performed better than Metabat. MetaBat can do with single or multiple sample but I guess as an automated method, it's better when using multiple samples.

Best,

Cedric

Thanks for the timely response and your patience. Best support of the year!

— Reply to this email directly or view it on GitHub https://github.com/claczny/VizBin/issues/33#issuecomment-150485187.

claczny commented 8 years ago

Now I see there are ten different types in total, in terms of the combination of shapes and colors. The default is blue dot, whose value is 2. And '0' doesn't mean anything in vizbin legend.

Not quite. There is a little glitch, s.a. Issue #22

annotation file according to the clusters I picked from VizBin

Now I got it :)

Maybe I should just use the clusters of interest to plot the figure I want, which will be noise-free.

That might be one way. Yet, I think you got pretty much what you need to do a nice plot with ggplot in R. By that I mean that you have the 2D coordinates (point.txt) and an annotation file that highlights your points of interest. You could then simply use ggplotand geom_point() to get a stylish figure. And within R you are free to assign the shapes and colours based on the labels you chose.

In terms of the completeness and contamination rate of the bins, vizbin performed better than Metabat.

That is good to hear :+1:

Thanks for the timely response and your patience. Best support of the year!

Happy to be of help!