Edinburgh-Genome-Foundry / DnaFeaturesViewer

:eye: Python library to plot DNA sequence features (e.g. from Genbank files)
https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer/
MIT License
584 stars 90 forks source link

Big coordinates not handled #23

Closed Yu-jinKim closed 4 years ago

Yu-jinKim commented 4 years ago

Hi,

I'm testing out your tool to see if it's suited to my needs e.g. visualize primers in a pretty way.

from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt

sequence = "TCAAGCTTGCCATCTCTTCATGTTAGGAAACAAAAAGCCCTAGAAGCAGAATTAGATGCTCAGCACTTATCAGAAACTTT"

record = GraphicRecord(sequence=sequence, features=[
    GraphicFeature(start=112173530, end=112173530+20, strand=+1, color="#ffd700",
                   label="Primer1"),
    GraphicFeature(start=112173530+50, end=112173530+70, strand=-1, color="#cffccc",
                   label="Primer2"),
])
zoom_start1, zoom_end1 = 112173530, 112173530+30  # coordinates of the "detail"
zoom_start2, zoom_end2 = 112173530+40, 112173530+80
cropped_record1 = record.crop((zoom_start1, zoom_end1))
cropped_record2 = record.crop((zoom_start2, zoom_end2))

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(14, 3))

# PLOT THE WHOLE SEQUENCE

ax1.set_title("Whole sequence", loc='left', weight='bold')
record.plot(ax=ax1)
ax1.fill_between((zoom_start1, zoom_end1), +80, -80, alpha=0.15)
ax1.fill_between((zoom_start2, zoom_end2), +80, -80, alpha=0.15)

# PLOT THE SEQUENCE DETAILS

cropped_record1.plot(ax=ax2)
cropped_record1.plot_sequence(ax=ax2)
ax2.set_title("Sequence detail", loc='left')

#

cropped_record2.plot(ax=ax3)
cropped_record2.plot_sequence(ax=ax3)
ax3.set_title("Sequence detail", loc='left')

fig.savefig('overview_and_detail.png')

I copied some of your code and replaced some values to suit a real life situation and the program returns ValueError: out-of-bound cropping. Can the tool not handle big coordinates?

Zulko commented 4 years ago

The issue here is that the sequence parameter you provide to GraphicRecord should be the whole sequence, not the sequence of the segment. In your case it is not practical because your sequence is huge but there is another solution. Do not use crop. Instead use first_index=11217xxxx... in GraphicRecord, and set sequence_length to 40. Provide the short sequence as you do now. Sorry i am on mobile. Let me know if that makes sense.

Yu-jinKim commented 4 years ago

Thanks for the quick response, I managed to do that:

from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt

sequence = "#complete sequence"

record = GraphicRecord(sequence=sequence, first_index=112173530-20, sequence_length=700, features=[
    GraphicFeature(start=112173530, end=112173530+20, strand=+1, color="#ffd700",
                   label="Primer1"),
    GraphicFeature(start=112173530+50, end=112173530+70, strand=-1, color="#cffccc",
                   label="Primer2"),
])

sequence = "TCAAGCTTGCCATCTCTTCATGTTAGGAAACAAAAAGCCCTAGAAGCAGAATTAGATGCTCAGCACTTATCAGAAACTTT"

record_detail = GraphicRecord(sequence=sequence, first_index=112173530-10, sequence_length=40, features=[
    GraphicFeature(start=112173530, end=112173530+20, strand=+1, color="#ffd700",
                   label="Primer1"),
    GraphicFeature(start=112173530+50, end=112173530+70, strand=-1, color="#cffccc",
                   label="Primer2"),
])

# PLOT THE WHOLE SEQUENCE

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 2))

record.plot(ax=ax1)
record_detail.plot(ax=ax2)
record_detail.plot_sequence(ax2)

fig.savefig('overview_and_detail.png')

And I kinda get what I wanted. However is there a way to keep the original coordinates using first_index? Right now the x coordinates are truncated (112173530 becomes 3530 for example), I tried increasing the figsize to give enough space to fit the numbers but it doesn't work.

Zulko commented 4 years ago

I'm not entirely sure about your scenario but here is some alternative code which (I hope) does the same thing:

from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt
full_sequence = 120_000_000 * "n"  # replace with the actual sequence.
x = 112173530  # index of the region to plot
primer1 = GraphicFeature(start=x, end=x+20, strand=+1, color="#ffd700", label="Primer1")
primer2 = GraphicFeature(start=x+50, end=x+70, strand=-1, color="#cffccc", label="Primer2")
record = GraphicRecord(sequence = full_sequence, features=[primer1, primer2])
region_record = record.crop((x-20, x+700))
detail_record = record.crop((x-5, x+75))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 3), gridspec_kw={"width_ratios": [1, 2]})
region_record.plot(ax=ax1)
detail_record.plot(ax=ax2)
detail_record.plot_sequence(ax2)

image

I am not sure why the x coordinates are truncated for you but it could be due to Matplotlib's way of writing the coordinates. Have you noticed that there is a "+1.1217e8" notation under the axis, indicating that you should add 112,170,000 to the x-coordinate.

Yu-jinKim commented 4 years ago

Hi again,

First of thanks for all your help. Second, I think i have a snippet which I does what i envisioned in the beginning.

sequence = "TCAAGCTTGCCATCTCTTCATGTTAGGAAACAAAAAGCCCTAGAAGCAGAATTAGATGCTCAGCACTTATCAGAAACTTTTGACAATATAGACAATTTAAGTCCCAAGGCATCTCATCGTAGTAAGCAGAGACACAAGCAAAGTCTCTATGGTGATTATGTTTTTGACACCAATCGACATGATGATAATAGGTCAGACAATTTTAATACTGGCAACATGACTGTCCTTTCACCATATTTGAATACTACAGTGTTACCCAGCTCCTCTTCATCAAGAGGAAGCTTAGATAGTTCTCGTTCTGAAAAAGATAGAAGTTTGGAGAGAGAACGCGGAATTGGTCTAGGCAACTACCATCCAGCAACAGAAAATCCAGGAACTTCTTCAAAGCGAGGTTTGCAGATCTCCACCACTGCAGCCCAGATTGCCAAAGTCATGGAAGAAGTGTCAGCCATTCATACCTCTCAGGAAGACAGAAGTTCTGGGTCTACCACTGAATTACATTGTGTGACAGATGAGAGAAATGCACTTAGAAGAAGCTCTGCTGCCCATACACATTCAAACACTTACAATTTCACTAAGTCGGAAAATTCAAATAGGACATGTTCTATGCCTTATGCCAAATTAGAATACAAGAGATCTTCAAATGATAGTTTAAATAGTGT"
seq_start = 112173530
name1 = "Primer_1"
start1 = 112173630
end1 = 112173650
strand1 = +
name2 = "Primer_2"
start2 = 112174069
end2 = 112174091
strand2 = -

chunks = [sequence[i:i+80] for i in range(0, len(sequence), 80)]

primers = [
    GraphicFeature(start=start1, end=end1, strand="{}1".format(strand1), color="#ffd700",
                    label=name1),
    GraphicFeature(start=start2, end=end2, strand="{}1".format(strand2), color="#cffccc",
                    label=name2)
]

records = []

for i, chunk in enumerate(chunks):
    record = GraphicRecord(sequence=chunk, first_index=seq_start+i*80, sequence_length=80, features=primers)
    records.append(record)

fig, axes = plt.subplots(nrows=len(chunks), figsize=(14, 10))

for i, ax in enumerate(axes):
    ax.ticklabel_format(useOffset=False, style='plain')
    record = records[i]
    # record.finalize_ax(ax= ax, features_levels = 15, annotations_max_level = 15, auto_figure_height = True, ideal_yspan=1)
    record.plot(ax=ax)
    record.plot_sequence(ax = ax, background = None)

fig.savefig('utils/primer_visualization/{}-{}.pdf'.format(name1, name2), format="pdf")

However I can't figure out how to set the size of the records for them to be a good size for the information to be displayed nicely.

image

image

I try using the finalize_ax method but I couldn't find a good fit. I thought I used about using a multiplier depending of the size of the coverage but it doesn't seem satisfactory to me. Any advice?

Zulko commented 4 years ago

Hi, that's actually a difficult scenario but there should be a (complicated) solution where you would first plot each line as an individual figure to determine how much vertical space it needs needs, then you create subplots as you do now, with a figure height equal to the total, and each ax with the right proportional height (using gridspec_kw={"width_ratios": [1, 2]} in plt.subplots()). I don't have a computer at hand now, I'll try it tomorrow, as this could become a new feature in the library.

Zulko commented 4 years ago

I pushed v2.4 today. In this version, you can plot multiline plots as shown in this code example. This new feature will (hopefully) always make the figure high enough so that the lines don't get squeezed.

This should get you closer to what you want to achieve. Let me know if it helps or if you have any suggestion.

Yu-jinKim commented 4 years ago

Awesome! I had finally given up and put a multiplier to determine the height of the whole figure. I have noticed 2 things:

Other than that, it looks great! Thanks for all your help

Zulko commented 4 years ago

Great suggestions thanks. I pushed a new version 2.5.0 which doesnt use offsets, adds "," to separate thousands and millions, and lets you define graphic_record.ticks_resolution=25 to have ticks only every 25bp.

Example:

from dna_features_viewer import GraphicRecord, GraphicFeature
import matplotlib.ticker as ticker
start = 123456
record = GraphicRecord(
    sequence_length=240,
    features=[
        GraphicFeature(start+10, start+70, 1, label="bla"),
        GraphicFeature(start+60, start+130, 1, label="ble"),
        GraphicFeature(start+140, start+210, -1, label="bli"),
    ],
    first_index=start,
    ticks_resolution=25
)
fig, axes = record.plot_on_multiple_lines(nucl_per_line=80)

Selection_999(868)

Let me know if that works for you.

Yu-jinKim commented 4 years ago

Brilliant! Last question hopefully: do you have a check for the size of the figure? I have some cases where the figure is bigger than 2^16 so matplotlib doesn't like it but i get to know that pretty late in the execution of the code. If I could know that before hand, I could try and crop the GraphicRecord for regions of interest.

Zulko commented 4 years ago

I believe this a bit of an extreme case, and in the current code I only know the figure's final size at 60% of the process' duration, so it wouldn't save you much time. It would be better if you did these checks on your side. Would a multi-page PDF (instead of a single big figure) fit your needs?

Yu-jinKim commented 4 years ago

I guess it would yes. Otherwise, I can just do as I said and crop for regions of interest.

Zulko commented 4 years ago

I pushed a new major version online which features a record.plot_on_multiple_pages feature (see here for an example).

Closing this thread. Please open a new issue if you run into a problem.

Yu-jinKim commented 4 years ago

Great, I'll definitely be recommending your tool to people !

Zulko commented 4 years ago

Thanks!