deeptools / pyGenomeTracks

python module to plot beautiful and highly customizable genome browser tracks
GNU General Public License v3.0
752 stars 111 forks source link

Arcs color scale based on score #30

Closed gtrichard closed 4 years ago

gtrichard commented 6 years ago

Hello @fidelram and @vivekbhr ,

I started using pyGenomeTracks and I really enjoy it, thanks for the hard work 👍

Would it be possible to use the last column of test.arcs file (https://github.com/deeptools/pyGenomeTracks/blob/master/pygenometracks/tests/test_data/test.arcs) as a basis for a color scale of the arcs in order to display, for example, the interaction strength between 2 loci?

https://github.com/deeptools/pyGenomeTracks/blob/master/pygenometracks/tests/test_data/test.arcs

Best, Gautier

DonStephano commented 6 years ago

The newest version of pyGenomeTracks is very very useful and saves a lot of time!

@gtrichard s idea of the color-scale would is great and would be a cool feature. For us it would be also helpful to add a rgb-based-column so that each arc or a previously specified group of arcs (i.e. all interactions from a promotor) is plotted in a separate color (see example).

test

yichao-cai commented 4 years ago

I am diving into the pyGenomeTracks and I found it very useful! Thanks a lot for the tool!

After digging into the enhancements that you are currently working on (#102), I found that the color-scale based on the score column has been added into the pull request. Is it possible to add in the feature of using additional columns in .arc file for customization of the plotting style? That would be super useful.

For example,

# chrom1    start1    end1    chrom2    start2    end2    score    [additional column]
chr1    1    100    chr1    501    600    4    red
chr1    201    300    chr1    501    600    4    blue

Use the addition column as the attribute of the color when doing the plotting. Something like assigning colors to different levels of factors in R/ggplot2.

For example, if the 'color' field in .ini file specified as a color map name, scores are used to scale color to each arc (which has been included in the pull request #102 ). If the 'color' field in .ini file specified as 'predefined', the additional column would be used as the predefined color when calling plot_arcs().

For further versatile plotting, would you consider including user pre-defined attributes if it is possible? For example, include a field of 'additionalColumn' in .ini file to be used as the key of the additional column to be used in self.properties that overrides the default value. For example, in .arc file:

chr1    1    100    chr1    501    600    4    red    dotted
chr1    201    300    chr1    501    600    4    blue    dashed

In .ini file:

[test arcs]
file = test.arcs
title = links orientation=inverted
orientation = inverted
line style = dashed
height = 2
additionalColumn = ["color", "line style"]

To use the additional column to override the original value, self.properties["color"] = ["red", "blue"] and self.properties["line style"] = ["dotted", "dashed"].

These are just some thoughts that I think would be good to have. I don't have the experience of developing such tools in Python, so please bear with me if I raise anything that is impossible to implement.

EDIT:

So I have a little twist on the LinkTracks.py. Hope this might help others that are looking for color-coding each arc.

A couple of things that I did:

  1. Add self.col_additional for placeholder of additional columns that are specified in the .ini file.
  2. Unpack each line in the .arc file with chrom1, start1, end1, chrom2, start2, end2, score, *data_additional = line.strip().split('\t') to capture the data in the additional column. And score together with data_additional are passed to the interval tree in dictionaries instead of just a double (previously just scores are passed).
  3. Modify the plot_arcs() to use the user-specified properties passed in the additional columns. If not found, fall back to defaults. Also, I twisted the height of arcs according to the score (previously this is set to the spanned distance of the arc).
  4. Adjust the plotting of y-axis. Previously, plotting range was set to self.max_height * 1.1. This is good but when the scores are small (1 or 2), this would result in a relatively large extension in the plotting range. I kept using the 1.1 extension factor in plotting range, but position the max y label to the position of the highest arc (self.max_height) instead of the position of the max y range (self.max_plot_height).
from . GenomeTrack import GenomeTrack
from intervaltree import IntervalTree, Interval
import numpy as np
import re

class LinksTrack(GenomeTrack):
    SUPPORTED_ENDINGS = ['.arcs', '.arc', '.link', '.links']
    TRACK_TYPE = 'links'
    OPTIONS_TXT = GenomeTrack.OPTIONS_TXT + """
# the file format for links is (tab separated)
#   chr1 start1 end1 chr2 start2 end2 score
# for example:
#   chr1 100 200 chr1 250 300 0.5
# depending on the links type either and arc or a 'triangle' can be plotted. If an arc,
# a line will be drawn from the center of the first region (chr1: 150, tot the center of the other region (chr1:275).
# if a triangle, the vertix of the triangle will be drawn at the center between the two points (also the center of
# each position is used)
# links whose start or end is not in the region plotted are not shown.
# color of the lines
color = red
# for the links type, the options are arcs and triangles, the triangles option is convenient to overlay over a
# Hi-C matrix to highlight the matrix pixel of the highlighted link
links type = arcs
# if line width is not given, the score is used to set the line width
# using the following formula (0.5 * square root(score)
# line width = 0.5
# options for line style are 'solid', 'dashed', 'dotted' etc. The full list of
# styles can be found here: https://matplotlib.org/gallery/lines_bars_and_markers/linestyles.html
line style = solid
file_type = {}
# additional columns
# If additional columns is given and names matched the properties of the links (either 'color', 'line_style' or 'line_width')
    """.format(TRACK_TYPE)

    def __init__(self, *args, **kwarg):
        super(LinksTrack, self).__init__(*args, **kwarg)
        # the file format expected is similar to file format of links in
        # circos:
        # chr1 100 200 chr1 250 300 0.5
        # where the last value is a score.
        if 'line width' not in self.properties:
            self.properties['line width'] = 0.5
        if 'line style' not in self.properties:
            self.properties['line style'] = 'solid'
        if 'links type' not in self.properties:
            self.properties['links type'] = 'arcs'
        if 'color' not in self.properties:
            self.properties['color'] = 'blue'
        if 'alpha' not in self.properties:
            self.properties['alpha'] = 0.8
        # Check whether additional columns are passed in the .ini file
        if 'addition columns' in self.properties.keys():
            print("Additional columns supplied in .ini file")
            self.col_additional = self.properties.pop('addition columns')
            self.col_additional = list(re.sub("[' \[\]]", "", self.col_additional).split(","))
            print(self.col_additional)
        else:
            self.col_additional = None

        self.max_height = None
        self.max_plot_height = None
        valid_intervals = 0
        interval_tree = {}
        line_number = 0

        with open(self.properties['file'], 'r') as file_h:
            for line in file_h.readlines():
                line_number += 1
                if line.startswith('browser') or line.startswith('track') or line.startswith('#'):
                    continue
                try:
                    chrom1, start1, end1, chrom2, start2, end2, score, *data_additional = line.strip().split('\t')
                except Exception as detail:
                    self.log.error('File not valid. The format is chrom1 start1, end1, '
                                   'chrom2, start2, end2, score\nError: {}\n in line\n {}'.format(detail, line))
                    exit(1)

                try:
                    start1 = int(start1)
                    end1 = int(end1)
                    start2 = int(start2)
                    end2 = int(end2)
                except ValueError as detail:
                    self.log.error("Error reading line: {}. One of the fields is not "
                                   "an integer.\nError message: {}".format(line_number, detail))
                    exit(1)

                assert start1 <= end1, "Error in line #{}, end1 larger than start1 in {}".format(line_number, line)
                assert start2 <= end2, "Error in line #{}, end2 larger than start2 in {}".format(line_number, line)
                try:
                    score = float(score)
                except ValueError as detail:
                    self.log.error("Error reading line: {}. The score is not valid {}. "
                                   "\nError message: {}".format(line_number, detail))
                    exit(1)

                if chrom1 != chrom2:
                    self.log.warn("Only links in same chromosome are used. Skipping line\n{}\n".format(line))
                    continue

                if chrom1 not in interval_tree:
                    interval_tree[chrom1] = IntervalTree()

                if start2 < start1:
                    start1, start2 = start2, start1
                    end1, end2 = end2, end1

                # each interval spans from the smallest start to the largest end
                if self.col_additional is None:
                    interval_tree[chrom1].add(Interval(start1, end2, {'score': score}))
                else:
                    if(len(data_additional) != len(self.col_additional)):
                        raise ValueError("The number of additional columns is not equal to the length of list that you passed in the .ini file. Please chekc again your .ini and link file!!!!")
                    else:
                        infoCols = dict(zip(['score'] + self.col_additional, [score] + data_additional))
                        interval_tree[chrom1].add(Interval(start1, end2, infoCols))
                valid_intervals += 1

        if valid_intervals == 0:
            self.log.warn("No valid intervals were found in file {}".format(self.properties['file']))

        file_h.close()
        self.interval_tree = interval_tree

    def plot(self, ax, chrom_region, region_start, region_end):
        """
        Makes and arc connecting two points on a linear scale representing
        interactions between Hi-C bins.
        :param ax: matplotlib axis
        :param label_ax: matplotlib axis for labels
        """
        self.max_height = 0
        count = 0

        if chrom_region not in list(self.interval_tree):
            chrom_region_before = chrom_region
            chrom_region = self.change_chrom_names(chrom_region)
            if chrom_region not in list(self.interval_tree):
                self.log.error("*Error*\nNeither " + chrom_region_before + " "
                               "nor " + chrom_region + " exits as a chromosome"
                               " name inside the link file.\n")
                return

        chrom_region = self.check_chrom_str_bytes(self.interval_tree, chrom_region)

        arcs_in_region = sorted(self.interval_tree[chrom_region][region_start:region_end])

        for idx, interval in enumerate(arcs_in_region):
            # skip intervals whose start and end are outside the plotted region
            if interval.begin < region_start and interval.end > region_end:
                continue

            if 'line width' in self.properties:
                self.line_width = float(self.properties['line width'])
            else:
                self.line_width = 0.5 * np.sqrt(interval.data)

            if self.properties['links type'] == 'triangles':
                self.plot_triangles(ax, interval)
            else:
                self.plot_arcs(ax, interval)

            count += 1

        # the arc height is equal to the radius, the track height is the largest
        # radius plotted plus an small increase to avoid cropping of the arcs
        self.max_plot_height = self.max_height * 1.1
        self.log.debug("{} were links plotted".format(count))
        if 'orientation' in self.properties and self.properties['orientation'] == 'inverted':
            ax.set_ylim(self.max_plot_height, -0.01)
        else:
            ax.set_ylim(-0.01, self.max_plot_height)

        self.log.debug('title is {}'.format(self.properties['title']))

    #  This is the original y axis plotting function in LinkTrack.py
    # def plot_y_axis(self, ax, plot_axis):
    #     pass

    def plot_y_axis(self, ax, plot_axis):
        """
        Plot the scale of the y axis with respect to the plot_axis
        Args:
            ax: axis to use to plot the scale
            plot_axis: the reference axis to get the max and min.

        Returns:

        """
        if 'show data range' in self.properties and self.properties['show data range'] == 'no':
            return

        def value_to_str(value):
            # given a numeric value, returns a
            # string that removes unneeded decimal places
            if value % 1 == 0:
                str_value = str(int(value))
            else:
                str_value = "{:.1f}".format(value)
            return str_value

        ymin, ymax = plot_axis.get_ylim()

        # Swap the position of ymax and ymin if arcs are plotted invertedly.
        if 'orientation' in self.properties and self.properties['orientation'] == 'inverted':
            ymax_str = '0'
            ymin_str = value_to_str(self.max_height)
        else:
            ymax_str = value_to_str(self.max_height)
            ymin_str = '0'

        # plot something that looks like this:
        # ymax ┐
        #      │
        #      │
        # ymin ┘
        # the coordinate system used is the ax.transAxes (lower left corner (0,0), upper right corner (1,1)
        # this way is easier to adjust the positions such that the lines are plotted complete
        # and not only half of the width of the line.
        ## CYC: Calculate the position of max value in the score column of the interval tree of the track. See equations below: (inverted arcs equation is not given be should be complementary to this one)
        ##    ymax_rel_pos      (ymax-0)/1.1 + 0.01
        ##    ------------   =  -------------------   
        ##         1                 ymax - ymin
        ##    ymax_rel_pos      self.max_height + 0.01
        ##    ------------   =  ----------------------   
        ##         1            self.max_plot_height + 0.01

        if 'orientation' in self.properties and self.properties['orientation'] == 'inverted':
            ymax_rel_pos = 0.99
            ymin_rel_pos = 1 - (self.max_height + 0.01) / (self.max_plot_height + 0.01)
        else:
            ymax_rel_pos = (self.max_height + 0.01) / (self.max_plot_height + 0.01)
            ymin_rel_pos = 0.01

        x_pos = [0, 0.5, 0.5, 0]
        y_pos = [ymin_rel_pos, ymin_rel_pos, ymax_rel_pos, ymax_rel_pos]
        ax.plot(x_pos, y_pos, color='black', linewidth=1, transform=ax.transAxes)
        ax.text(-0.2, ymin_rel_pos, ymin_str, verticalalignment='bottom', horizontalalignment='right', transform=ax.transAxes)
        ax.text(-0.2, ymax_rel_pos, ymax_str, verticalalignment='top', horizontalalignment='right', transform=ax.transAxes)
        ax.patch.set_visible(False)

        ## Below is the obselete modification. 
        ## CYC: Use absolute coordinates to plot the y axis. Although I extended the plotting area in plot region, I want the y axis and labels to fit the original dataset.
        ## That is, for x=[0,1,2,3,4,5] and y=[0,1,2,3,4,5], I plot the figure with a range of (-0.01,5*1.01) to avoid cropping. However, I let the y axis to fir the original range. If not, 5*1.01 would be show as the max value of y axis. If the given data range for y is small, this would be a dramatic difference.
        # _xmin, _xmax = ax.get_xlim()
        # _xmid = (_xmin + _xmax)/2

        # x_pos = [0, _xmid, _xmid, 0]
        # y_pos = [0, 0, self.max_height, self.max_height]
        # print(type(ax))
        # ax.plot(x_pos, y_pos, color='black', linewidth=1)
        # ax.text(-1*_xmid/5*2, 0, ymin_str, verticalalignment='bottom', horizontalalignment='right')
        # ax.text(-1*_xmid/5*2, self.max_height, ymax_str, verticalalignment='top', horizontalalignment='right')
        # ax.patch.set_visible(False)

    def plot_arcs(self, ax, interval):
        '''
        Previously, radius (distance between interval.begin and interval.start) 
        is used as height.
        Modification: Use score as height instead. If you want distance as height,
        just modify the score column to be the spanned distance of each arc. It should
        be pretty easy using pandas or excel.
        '''
        from matplotlib.patches import Arc

        diameter = (interval.end - interval.begin)
        # radius = float(diameter) / 2
        center = interval.begin + float(diameter) / 2
        if interval.data['score'] > self.max_height:
            self.max_height = interval.data['score']
        ax.plot([center], [diameter])
        if(self.col_additional is None):
            ax.add_patch(Arc((center, 0), diameter,
                             interval.data['score']*2, 0, 0, 180,
                             color=self.properties['color'],
                             linewidth=self.line_width,
                             ls=self.properties['line style']))
            print("Interval: {} is printed by plot_arcs function. (no additional columns))".format(interval))
        else:
            ax.add_patch(Arc((center, 0), diameter,
                             interval.data['score']*2, 0, 0, 180,
                             color=interval.data['color'] if 'color' in self.col_additional else self.properties['color'],
                             linewidth=interval.data['line_width'] if 'line_width' in self.col_additional else self.line_width,
                             ls=interval.data['line_style'] if 'line_style' in self.col_additional else self.properties['line style']))
            print("Interval: {} is printed by plot_arcs function. (with additional columns)".format(interval))

    def plot_triangles(self, ax, interval):
        from matplotlib.patches import Polygon
        x1 = interval.begin
        x2 = x1 + float(interval.end - interval.begin) / 2
        x3 = interval.end
        y1 = 0
        y2 = (interval.end - interval.begin)

        triangle = Polygon(np.array([[x1, y1], [x2, y2], [x3, y1]]), closed=False,
                           facecolor='none', edgecolor=self.properties['color'],
                           linewidth=self.line_width,
                           ls=self.properties['line style'])
        ax.add_artist(triangle)
        if y2 > self.max_height:
            self.max_height = y2

Using a .ini file of this


[x-axis]
where = top
title = where=top

[spacer]
height = 0.05

[test arcs]
file = test.arcs
title = links orientation=inverted
orientation = inverted
line style = dashed
height = 2

[spacer]

[test bigwig]
file = bigwig2_X_2.5e6_3.5e6.bw
color = blue
height = 1.5
title = bigwig number of bins=2000
number of bins = 2000

[spacer]

[test bigwig overlay]
file = bigwig_chrx_2e6_5e6.bw
color = #0000FF80
title =
min_value =0
max_value = 50
show data range = no
overlay previous = yes
number of bins = 100

[spacer]
height = 1

[test arcs]
file = test.arcs
color = red
line width = 2
title = links line width=3
height = 3
addition columns = ['color', 'line_style']

[spacer]
height = 0.5
title = height=0.5

[x-axis]
fontsize=30
title = fontsize=30

[vlines]
file = tad_classification.bed
type = vlines

I am able to get a figure like this: test.pdf

bgruening commented 4 years ago

@YichaoCaiDBS we will release soon a new version and this issues should be fixed. If you have further requests can you please add a new issue? I gonna close this one now.