Edinburgh-Genome-Foundry / DnaFeaturesViewer

:eye: Python library to plot DNA sequence features (e.g. from Genbank files)
https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer/
MIT License
584 stars 90 forks source link

Feature request: x_lim best-fit #66

Closed outpaddling closed 2 years ago

outpaddling commented 2 years ago

Great library, thanks for publishing!

I'm visualizing gene neighborhoods from a GFF3 file:

record = BiopythonTranslator().translate_record("jun-hood.gff3", filetype="gff")
ax, _ = record.plot(x_lim = [15500000, 16000000], strand_in_label_threshold=7)
plt.show()

By default, the left x limit is 0, right x limit is the end of the last feature. Wondering if you could either change the default or add an option so that left x limit is the beginning of the first feature instead of 0. Users could add some code before record.plot() to extract the limits from the GFF, but that would be messy and redundant.

An example of how to use x_lim in the API reference would be nice for NOOBs as well. I figured it out from "must be iterable" in the error message and some trial-and-error (I'm not much of a Pythonista).

veghp commented 2 years ago

Thank you for the feedback. We can certainly add a little more description for the parameter. As for the features, do you mean just plotting the region where features are, i.e. "trim" the sequence at both ends? That may be possible but I'm not sure that goes well with the logic of the plotting function, or even the package. But we can add it as an example.

Zulko commented 2 years ago

I agree with @veghp . I understand that automatic trimming would be very useful, but this library should probably stay focused on "plotting what the user wants" rather than "smartly detecting what to plot for the user". Users may want to leave a few nucleotides of padding around a region of interest, they may want to only show the region around certain categories of features, etc. This would eventually overload .plot (which focuses on graphical elements) with many new parameters. It woud make a nice example.

The lack of documentation for x_lims is my bad, sorry for that. As an aside, for your use case I believe you can also equivalently use graphic_record.crop as shown in this section.

outpaddling commented 2 years ago

Well, what I get without trimming:

4       pmch    gene    17336557        17338753        -1.000000       +       .       ID=gene:ENSDARG00000073959;Name=pmch
4       parpbp  gene    17337763        17353972        -1.000000       -       .       ID=gene:ENSDARG00000029944;Name=parpbp
4       th2     gene    17380578        17391091        -1.000000       -       .       ID=gene:ENSDARG00000038384;Name=th2
4       pah     gene    17391680        17409533        -1.000000       -       .       ID=gene:ENSDARG00000020143;Name=pah
4       ascl1a  gene    17417111        17419193        -1.000000       +       .       ID=gene:ENSDARG00000038386;Name=ascl1a
4       recql   gene    17424855        17538270        -1.000000       +       .       ID=gene:ENSDARG00000007175;Name=recql
4       itfg2   gene    17539134        17553996        -1.000000       +       .       ID=gene:ENSDARG00000002174;Name=itfg2
4       nrip2   gene    17572495        17629444        -1.000000       -       .       ID=gene:ENSDARG00000079985;Name=nrip2
4       klhl42  gene    17634311        17642168        -1.000000       -       .       ID=gene:ENSDARG00000006011;Name=klhl42

untrimmed

I eliminated the huge empty space by parsing the GFF to find the lowest start and highest end position of all features, then supplying the values via x_lim:

#!/usr/bin/env python3

import sys,os

import matplotlib.pyplot as plt
from dna_features_viewer import GraphicFeature, GraphicRecord, CircularGraphicRecord
from dna_features_viewer import BiopythonTranslator
from BCBio import GFF

argc = len(sys.argv)
if ( argc < 2 ):
    print("Usage:", sys.argv[0], "file.gff3 [file.gff3 ...]\n")
    exit(1)

for i in range(1, argc):
    filename = sys.argv[i]

    # Determine boundaries of the whole neighborhood
    x_min=9999999999999999
    x_max=0
    with open(filename) as infile:
        for line in infile:
            if line[0] != '#':
                cols = line.split("\t")
                start = int(cols[3])
                if start < x_min:
                    x_min = start
                end = int(cols[4])
                if end > x_max:
                    x_max = end
    infile.close()
    print("Start coordinate =", x_min, "End coordinate = ", x_max)

    # Convert GFF file to a matplotlib record
    record = BiopythonTranslator().translate_record(filename, filetype="gff")
    ax, _ = record.plot(figure_width=12, x_lim = [x_min, x_max], strand_in_label_threshold=7)
    ax.set_title(filename)
    plt.show()

    stem = os.path.splitext(filename)[0]
    png_filename = ".".join([stem,"png"])
    ax.figure.savefig(png_filename, bbox_inches='tight')

I can live with having to do this, though it's hard to imagine anyone wanting the behavior of always starting the plot at position 0.

The other issue is that there is zero padding on the right. I found this a bit disconcerting since one can't tell at a glance whether the last feature has been truncated. After computing and passing x_lim, I get a margin on the left but not on the right.

trimmed

I could still be making a rookie mistake here.

Zulko commented 2 years ago

I eliminated the huge empty space by parsing the GFF

Since the GFF has already been parsed to give a record, you don't need to parse the record line by line again. Here is what it could look like:

record = BiopythonTranslator().translate_record(filename, filetype="gff")
crop_start = min([feature.start for feature in record.features])
crop_end = max([feature.start for feature in record.features])
cropped_record = record.crop((crop_start, crop_end))
ax, _ = cropped_record.plot(figure_width=12, strand_in_label_threshold=7)

The other issue is that there is zero padding on the right.

If the GFF file doesn't declare a total length (they should, and some do!) the BCBio module will use the end of the right-most annotation as that length. This explains why you can see the chromosome continued on the left but not on the right.

one can't tell at a glance whether the last feature has been truncated

It's difficult to see here because of the dark colors used, but the last annotation has a black border on its right, indicating it was not truncated.

outpaddling commented 2 years ago

How does one declare the total length in a GFF? I've looked at several specifications (gmod, ncbi, ensembl, etc.) and can't find any info about it. Thanks...

outpaddling commented 2 years ago

Thanks for the cropping example.

Ultimately, though, BioPythonTranslator is just adding more layers of software and limiting me to Genbank and GFF without really simplifying my work. For my purposes, I think working directly with GraphicFeature and GraphicRecord makes more sense. The code below is straightforward and easily adapted to BED or non-standard file formats.

BTW, it seems that translate_record() is missing from the API reference.

Thanks for the guidance and rapid responsiveness.

#!/usr/bin/env python3

import sys,os
import matplotlib.pyplot as plt
from dna_features_viewer import GraphicFeature, GraphicRecord

argc = len(sys.argv)
if ( argc < 2 ):
    print("Usage:", sys.argv[0], "file.gff3 [file.gff3 ...]\n")
    exit(1)

for i in range(1, argc):
    filename = sys.argv[i]
    features = []

    # Build list of features
    x_min=9999999999999999
    x_max=0
    c = 0
    with open(filename) as infile:
        for line in infile:
            if line[0] != '#':
                cols = line.split("\t")
                start = int(cols[3])
                end = int(cols[4])
                if start < x_min: x_min = start
                if end > x_max: x_max = end
                strand = 1 if cols[6] == '+' else -1
                features.append(GraphicFeature(start=start, end=end,
                            strand=strand, color="#33bbbb", label=cols[1]))
    infile.close()
    print("Start coordinate =", x_min, "End coordinate = ", x_max)
    print(features)

    # Prepare for matplotlib
    record = GraphicRecord(sequence_length = x_max, features=features)
    ax, _ = record.plot(with_ruler = True, figure_width = 12,
                x_lim = [x_min - 1000, x_max + 1000],
                strand_in_label_threshold = 7)
    ax.set_title(filename)
    plt.show()

    stem = os.path.splitext(filename)[0]
    png_filename = ".".join([stem,"png"])
    ax.figure.savefig(png_filename, bbox_inches='tight')