deeptools / pyGenomeTracks

python module to plot beautiful and highly customizable genome browser tracks
GNU General Public License v3.0
752 stars 111 forks source link

feature request: visualize chromatin state epilogos #6

Closed crazyhottommy closed 6 years ago

crazyhottommy commented 6 years ago

Hi, Many thanks for adding Hi-C support. I also deal with chromHMM state data. can you please add visualizing epilogos?

https://epilogos.altiusinstitute.org/

https://github.com/Altius/epilogos

you can download some example files http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/epilogos/

https://github.com/kcakdemir/HiCPlotter#epilogos-plotting- has some support, but it is hard-coded for 15 states. I have some in-house data and the states number maybe different.

Thanks! Tommy

fidelram commented 6 years ago

Thanks for the suggestion. I looked at one example and the epilogs look like this:

chr10   1400    1600    id:8,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.00
chr10   1600    1800    id:9,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.00
chr10   1800    2000    id:10,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0
chr10   2000    2200    id:11,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0
chr10   2200    2400    id:12,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0
chr10   2400    2600    id:13,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0
chr10   2600    2800    id:14,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0
chr10   2800    3000    id:15,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.0

Is there a description of the file format? How do you get the labels for this image:

image

crazyhottommy commented 6 years ago

Hi,

The histogram-like graph (is a consolidated representation of the raw chromHMM calls from multiple samples) on the top is the one I want from your tool. the lower part with labels are the raw chromHMM calls for each sample.

chr10   1400    1600    id:8,qcat:[ [-0.0079,6], [-0.0056,17], [-0.0035,13], [-0.0023,5], [-0.0013,11], [-0.0007,16], [-0.0006,12], [-0.0005,10], [-0.0005,9], [-0.0003,1], [-0.00

for each bin ( this case chromHMM is called with 200bp bin size and total 18 state number, but it can change to whatever bin size or state number when use the chromHMM command),

epilogos gives you weights of each state for that bin. e.g. for state 6, the weights is -0.0079.
for state 17, the weights is -0.0056.

I am saying weights, but it is some metrics derived from entropy I think.

one can plot a overlapping bar graph for each bin ( with height equal to the weights), and set transparency alpha = 0.5 so that we can see 15 states in a single bin.

one has to define a color for each 15 state.

That's what I understand. I also asked https://github.com/Altius/epilogos/issues/4

Thanks, Tommy

meuleman commented 6 years ago

Actually, overlapping/transparent bargraphs are usually not necessary: what is shown is a stacked bargraph. The 'qcat' format is essentially a list of "state, value" pairs, ordered by "value". This dictates the order in which states/colors should be shown in the bargraph. The "values" are the contributions of each state the the overall epilogos score.

Hope it helps. Excited to see possible integration of epilogos with deepTools!

crazyhottommy commented 6 years ago

@meuleman thanks for chime in!

fidelram commented 6 years ago

@meuleman regarding the labels for the states I assume that this needs to be given by the user or is there any standard format that (eg json) to report this information? Or maybe the qcat file can have a header?

@crazyhottommy I am quite busy at the moment but I will try to take a look. Do you have any python experience? I can guide you on how to make a epilogs track.

crazyhottommy commented 6 years ago

Thanks @fidelram. For the first question, there is a convention from the ENCODE for 15 states http://wiki.wubrowse.org/QuantitativeCategorySeries

This can be set as default. but in reality, one can have 18 or any number of state model, and need to specify 18 different colors. support for user provided color will be nice.

For the second. I do have some python experience, but quite limited :) I have taken https://github.com/crazyhottommy/MIT6.00.1x-Introduction-to-Computer-Science-and-Programming-Using-Python and https://github.com/crazyhottommy/Coursera_Bioinformatics_for_Beginners

Recently, I have been playing a lot with Snakemake. https://gitlab.com/tangming2005/snakemake_DNAseq_pipeline/tree/multiRG

I should be able to make a PR with the help of you. I just hope it will not take more time from you than you implement it...

PS, just followed you on twitter :)

Best, Tommy

fidelram commented 6 years ago

@crazyhottommy I added some basic support for epilogos.

To use them you will need the install the epilogos branch

This is the example of the output based on a tabix file with 15 categories

image

The configuration file used was:

[epiloto]
file = epilog.qcat.bgz
height = 5
title = height=5

[spacer]

[x-axis]

The colors are chosen using some non-continuous color list.

What is missing is the option to input the desired colors and the description for the classes. Any ideas?

@meuleman I didn't know how to handle negative values (actually, what do they mean?). What I assumed is that for positive values, the scores reported stack on top of each other in the given order starting from cero. For negative values I assume that the scores also stack on top of each other in the same order as positive, but they start not from zero but from the sum of negative values.

If for example the scores are (without the class id) [-2, -1, 0, 1, 2] what I do is to plot the first score from -3 to -1, the next from -1 to 0, then 0 to 0, 0 to 1 and 1 to 3. Is this the correct interpretation?

crazyhottommy commented 6 years ago

@fidelram that looks beautiful! thx! for colors, one need to specify in the config file as well using ymal or json format? see http://wiki.wubrowse.org/QuantitativeCategorySeries

something like:

categories:{
          '1':['Active TSS','#ff0000'],
          '2':['Flanking Active TSS','#ff4500'],
          '3':['Transcr at gene 5\' and 3\'','#32cd32'],
          '4':['Strong transcription','#008000'],
          '5':['Weak transcription','#006400'],
          '6':['Genic enhancers','#c2e105'],
          '7':['Enhancers','#ffff00'],
          '8':['ZNF genes & repeats','#66cdaa'],
          '9':['Heterochromatin','#8a91d0'],
          '10':['Bivalent/Poised TSS','#cd5c5c'],
          '11':['Flanking Bivalent TSS/Enh','#e9967a'],
          '12':['Bivalent Enhancer','#bdb76b'],
          '13':['Repressed PolyComb','#808080'],
          '14':['Weak Repressed PolyComb','#c0c0c0'],
          '15':['Quiescent/Low','#ffffff']
    },
}

one has to specify the hex color and also the name for each state. The name is useful when adding a color key (legend) instead of just 1 to 15.

@meuleman I do not know how the negative values should be handled either.

meuleman commented 6 years ago

Looks great!

The way you're handling the negatives right now seems to be the correct way

On Tue, Mar 27, 2018 at 9:52 AM, Ming Tang notifications@github.com wrote:

@fidelram https://github.com/fidelram that looks beautiful! thx! for colors, one need to specify in the config file as well using ymal or json format? see http://wiki.wubrowse.org/QuantitativeCategorySeries

something like:

categories:{ '1':['Active TSS','#ff0000'], '2':['Flanking Active TSS','#ff4500'], '3':['Transcr at gene 5\' and 3\'','#32cd32'], '4':['Strong transcription','#008000'], '5':['Weak transcription','#006400'], '6':['Genic enhancers','#c2e105'], '7':['Enhancers','#ffff00'], '8':['ZNF genes & repeats','#66cdaa'], '9':['Heterochromatin','#8a91d0'], '10':['Bivalent/Poised TSS','#cd5c5c'], '11':['Flanking Bivalent TSS/Enh','#e9967a'], '12':['Bivalent Enhancer','#bdb76b'], '13':['Repressed PolyComb','#808080'], '14':['Weak Repressed PolyComb','#c0c0c0'], '15':['Quiescent/Low','#ffffff'] }, }

one has to specify the hex color and also the name for each state. The name is useful when adding a color key (legend) instead of just 1 to 15.

@meuleman https://github.com/meuleman I do not know how the negative values should be handled either.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deeptools/pyGenomeTracks/issues/6#issuecomment-376533149, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDFAWs4n3_tww-QqwT0wbpSn63Ze_VLks5tikQzgaJpZM4SjMhE .

crazyhottommy commented 6 years ago

@meuleman what does negative values mean then? thx

crazyhottommy commented 6 years ago

Thx @fidelram, please pin me when it is possible to specify a customer color configuration :)

crazyhottommy commented 6 years ago

sorry to bug, but any updates on this? thanks you :)

fidelram commented 6 years ago

Hi, I am sorry but I had not been able to work on this. Still in my todo list.

On Wed, May 23, 2018 at 2:30 AM Ming Tang notifications@github.com wrote:

sorry to bug, but any updates on this? thanks you :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deeptools/pyGenomeTracks/issues/6#issuecomment-391182973, or mute the thread https://github.com/notifications/unsubscribe-auth/AEu_1YejV1Mg7E1siNXbe4TzVIT2W5UNks5t1K2MgaJpZM4SjMhE .

crazyhottommy commented 6 years ago

no problem. really need this for one of my figures. thx!!

fidelram commented 6 years ago

I added now suport for categories colors using a json file like the following (noticed that single quotes are changed to double quotes and trailing comas are removed compared to the previous example):

{
"categories":{
          "1":["Active TSS","#ff0000"],
          "2":["Flanking Active TSS","#ff4500"],
          "3":["Transcr at gene 5\" and 3\"","#32cd32"],
          "4":["Strong transcription","#008000"],
          "5":["Weak transcription","#006400"],
          "6":["Genic enhancers","#c2e105"],
          "7":["Enhancers","#ffff00"],
          "8":["ZNF genes & repeats","#66cdaa"],
          "9":["Heterochromatin","#8a91d0"],
          "10":["Bivalent/Poised TSS","#cd5c5c"],
          "11":["Flanking Bivalent TSS/Enh","#e9967a"],
          "12":["Bivalent Enhancer","#bdb76b"],
          "13":["Repressed PolyComb","#808080"],
          "14":["Weak Repressed PolyComb","#c0c0c0"],
          "15":["Quiescent/Low","#ffffff"]
    }
}

Currently, the categories names are not used, only the color definition

crazyhottommy commented 6 years ago

Hi, thanks! The categories names are "Active TSS" etc? can I provide a json file with only 10 entries instead of 15 in this example? Tommy

fidelram commented 6 years ago

The number of categories have to match to the number of categories in the epilog.qcat file. The names of the categories are irrelevant, only the color is used.

crazyhottommy commented 6 years ago

thanks! that's good enough for now.

fidelram commented 6 years ago

Let me know if you stumble on any problem. Otherwise I will merge this branch into master.

crazyhottommy commented 6 years ago

go ahead. I am traveling and will test when I am back. Thanks!

crazyhottommy commented 6 years ago

If this is merged to master, I would like to try it out. You may want to add some documentation as well. Thanks again!

crazyhottommy commented 6 years ago

Hi, I am ready to test. What's the best way to install it? should I install from github eiplogos branch or it is better to have a conda receipt? Thanks!

fidelram commented 6 years ago

best is to install a branch using pip.

First, be sure to remove any installation of pgt:

conda remove pyGenomeTracks

Then install with pip

pip install git+https://github.com/deeptools/pyGenomeTracks.git@epilogos
crazyhottommy commented 6 years ago
pip install git+https://github.com/deeptools/pyGenomeTracks.git@epilogos
Collecting git+https://github.com/deeptools/pyGenomeTracks.git@epilogos
  Cloning https://github.com/deeptools/pyGenomeTracks.git (to epilogos) to /var/folders/79/x06wz9v560q10gw881n9c_z0x7m0vl/T/pip-Itdj5y-build
Collecting numpy>=1.12.1 (from pyGenomeTracks==1.0)
  Could not fetch URL https://pypi.python.org/simple/numpy/: There was a problem confirming the ssl certificate: [SSL: TLSV1_ALERT_PROTOCOL_VERSION] tlsv1 alert protocol version (_ssl.c:590) - skipping
  Could not find a version that satisfies the requirement numpy>=1.12.1 (from pyGenomeTracks==1.0) (from versions: )
No matching distribution found for numpy>=1.12.1 (from pyGenomeTracks==1.0)

A conda recipe will be better for me?

fidelram commented 6 years ago

Before installing pgt try:

conda install numpy>=1.12.1
crazyhottommy commented 6 years ago

have to install hicexplorer first

conda create -n pgt -c bioconda python=3  hicexplorer
source activate pgt
pip install git+https://github.com/deeptools/pyGenomeTracks.git@epilogos

installation seemed fine but a different error showed up

$ pyGenometracks
Traceback (most recent call last):
  File "/Users/mtang1/anaconda/envs/pgt/bin/pyGenometracks", line 4, in <module>
    from pygenometracks.plotTracks import main
  File "/Users/mtang1/anaconda/envs/pgt/lib/python3.6/site-packages/pygenometracks/plotTracks.py", line 150, in <module>
    import pygenometracks.tracksClass
  File "/Users/mtang1/anaconda/envs/pgt/lib/python3.6/site-packages/pygenometracks/tracksClass.py", line 31, in <module>
    from pygenometracks.tracks import *
ModuleNotFoundError: No module named 'pygenometracks.tracks'
fidelram commented 6 years ago

I have been hit by this before but I don't remember how did I manage to solve it. I will try installing pgt as you did to see if I find a solution.

crazyhottommy commented 6 years ago

Is it possible to merge to the master and release a bioconda version? thanks

fidelram commented 6 years ago

@crazyhottommy I found the problem!

Now it works.

simply do:

source activate pgt
pip uninstall pyGenomeTracks
pip install git+https://github.com/deeptools/pyGenomeTracks.git@epilogos

If you have a .qcat or .qcat.bgz file you can do:

make_tracks_file --trackFiles <your_file.qcat> --out <your_config_file.ini>

For a conda installation we need to make a new release and thus a prefer some testing before we do this.

crazyhottommy commented 6 years ago

@fidelram thx! it worked! However, it took 5-6 mins for me to make a figure with only 2 tracks of epilogos for --region chr7:116105904-116533055. I guess it is because when the region is big, there are a lot of bars that need to be drawn.

you may want to check https://github.com/kcakdemir/HiCPlotter/blob/master/HiCPlotter.py#L1341 to produce something like https://github.com/kcakdemir/HiCPlotter#epilogos-plotting-

When the region is big, need to think about how to present the bar graphs in a way that is fast and also not losing information.

BTW, without plotting the border of the bars may make it faster? or somehow need to index the qcat file? I saw you previously mentioned a tabix file.

How can I make the scale the same for the tracks?

Thanks for supporting this! Tommy

fidelram commented 6 years ago

Hi, if your qcat is not indexed using tabix it will take a lot of time just to read the files. See #13 for info on how to make a tabix file.

With the scale of the tracks you mean the limits of the y axis? Probably you can use min_value and max_value. Let me know if that works.

crazyhottommy commented 6 years ago

I figured that. thanks! now it only takes me within 10 seconds! yes, I meant the y-axis scale. I set the min_value and max_value, but it did not work.

my int file:

[x-axis]
#optional
#fontsize=20
# default is bottom meaning below the axis line
# where=top

[spacer]
# height of space in cm (optional)
height = 0.5

[NRpre]
file=NRpre.qcat.bgz
min_value = -0.4
max_value = 8.2

# title of track (plotted on the right side)
title = NRpre
# height of track in cm (ignored if the track is overlay on top the previous track)
height = 2
# if the track wants to be plotted upside-down:
# orientation = inverted
# if the track wants to be plotted on top of the previous track. Options are 'yes' or 'share-y'. For the 'share-y'
# option the y axis values is shared between this plot and the overlay plot. Otherwise, each plot use its own scale
#overlay previous = yes

# The categories file should contain the color information for each category id
# A categories file should look like:
# {
# "categories":{
#           "1":["Active TSS","#ff0000"],
#           "2":["Flanking Active TSS","#ff4500"],
#           "3":["Transcr at gene 5" and 3"","#32cd32"],
#           "4":["Strong transcription","#008000"],
#           "5":["Weak transcription","#006400"]
#   }
#}
categories_file = epilogos.json

[Rpre]
file=Rpre.qcat.bgz
min_value = -0.4
max_value = 8.2
# title of track (plotted on the right side)
title = Rpre
# height of track in cm (ignored if the track is overlay on top the previous track)
height = 2
# if the track wants to be plotted upside-down:
# orientation = inverted
# if the track wants to be plotted on top of the previous track. Options are 'yes' or 'share-y'. For the 'share-y'
# option the y axis values is shared between this plot and the overlay plot. Otherwise, each plot use its own scale
#overlay previous = yes

# The categories file should contain the color information for each category id
# A categories file should look like:
# {
# "categories":{
#           "1":["Active TSS","#ff0000"],
#           "2":["Flanking Active TSS","#ff4500"],
#           "3":["Transcr at gene 5" and 3"","#32cd32"],
#           "4":["Strong transcription","#008000"],
#           "5":["Weak transcription","#006400"]
#   }
#}
categories_file = epilogos.json

but the figure looks great!

screenshot 2018-07-26 12 09 14

Thanks!

crazyhottommy commented 6 years ago

One more comment, If I set overlay previous = "share-y"

Traceback (most recent call last):
  File "/Users/mtang1/anaconda/envs/pgt/bin/pyGenomeTracks", line 11, in <module>
    main(args)
  File "/Users/mtang1/anaconda/envs/pgt/lib/python3.6/site-packages/pygenometracks/plotTracks.py", line 301, in main
    trp.plot(args.outFileName, *region, title=args.title)
  File "/Users/mtang1/anaconda/envs/pgt/lib/python3.6/site-packages/pygenometracks/tracksClass.py", line 227, in plot
    plot_axis = axisartist.Subplot(fig, grids[idx, 1])
  File "/Users/mtang1/anaconda/envs/pgt/lib/python3.6/site-packages/matplotlib/gridspec.py", line 148, in __getitem__
    raise IndexError("index out of range")
IndexError: index out of range
pyGenomeTracks --tracks config.ini --region chr7:116105904-116533055 --title   6.86s user 0.25s system 98% cpu 7.179 total
fidelram commented 6 years ago

I am not getting the error.

fidelram commented 6 years ago

I just merged the epilogos branch into master.

crazyhottommy commented 6 years ago

Thanks! so far it works for me as well.