dwhswenson / contact_map

Contact map analysis for biomolecules; based on MDTraj
GNU Lesser General Public License v2.1
42 stars 18 forks source link

"Autozoom" plotting #89

Closed dwhswenson closed 3 years ago

dwhswenson commented 4 years ago

I find that I'm frequently doing something like:

plt.xlim(*contacts.query_residue_range)
plt.ylim(*contacts.haystack_residue_range)

This zooms my plot into just the query-by-haystack region. This should be the default behavior.

To get there, I'm suggesting the following:

  1. [API BREAK] Replacing n_x, n_y in ContactCounter by query_range and haystack_range. The original idea of separate n_x and n_y was to do something like this, but the original implementation didn't work, and now everything just uses n_x = n_y = n_total (where total can be residues or atoms).
  2. Adding a default parameter autozoom=True to the .plot and .plot_axes methods.
  3. Updating the dpi warning in .plot to know what the autozoomed size will be.
  4. [API BREAK] Remove ContactObject.query_residue_range, ContactObject.haystack_residue_range. They exist entirely for this purpose.

For discussion: is the query the x axis or the y axis?

sroet commented 4 years ago

For discussion: is the query the x axis or the y axis?

Assuming query is smaller than haystack; I personally prefer small/wide figures more than tall and skinny ones. So I would vote for query as default y (with an option from the user to maybe give x = all, query, or haystack (and the same for y).

What would you opinion be? (as you more often use these figures and probably intend to publish them)

dwhswenson commented 4 years ago

Probably best to show a real-world example of each type:

image

image

This is a loop region interacting with the entire protein, with similarities and differences in two parts of a trajectory. I was originally using query ~as x~ [EDIT:] as y (similar logic to yours) but I realized that I was thinking about it like a function that takes a query residue as input and outputs locations with interactions. This is why I'm curious for other opinions.

So I would vote for query as default y (with an option from the user to maybe give x = all, query, or haystack (and the same for y).

I don't want to add more parameters to the plotting function (in fact, I'd kind of like to remove some parameters; replace vmin/vmax with a bool diverging_cmap). However, I do think it would be very reasonable for the contact count to have something like total_range (or topology_range), in addition to query_range and haystack_range. It's easy enough then for users to get the current behavior with:

fig, ax = residue_contacts.plot()
ax.set_xlim(*residue_contacts.total_range)
ax.set_ylim(*residue_contacts.total_range)

Switching axes from the default would be similarly easy. This whole issue is something that is already not too hard, but I think our defaults are not optimal. (Also, it is annoying that I get a dpi warning when I know I'm going to be zooming in enough).

sroet commented 4 years ago

Alright, looking at the real data, I find the top two more readable, so query on the x axis now has my vote. (I was somehow thinking about scaling the figure size, not the data within the same aspect ratio. So having query as x gives the most 'short and wide' data representation.)

This whole issue is something that is already not too hard, but I think our defaults are not optimal. (Also, it is annoying that I get a dpi warning when I know I'm going to be zooming in enough).

That is fair.

One thing I just now noticed, we should probably off-center the labels slightly, so they overlap with the center of the corresponding block, instead of the right(?). What do you think about that?

dwhswenson commented 4 years ago

One thing I just now noticed, we should probably off-center the labels slightly, so they overlap with the center of the corresponding block, instead of the right(?). What do you think about that?

Nope. I had though about changing that, but realized there's a reason not to: left side is resid, which is the only number guaranteed to be unique. But that means that right side is typically resSeq. So we plot between the two! 😉

dwhswenson commented 4 years ago

Another option would be to rewrite the xticks so that they are the resSeq (as a string). I believe we could then have repeated resSeq, and then we could center on resSeq (which is what would be expected). Centering on resid is likely to lead to users reporting residues that are off by one.

sroet commented 4 years ago

Another option would be to rewrite the xticks so that they are the resSeq (as a string). I believe we could then have repeated resSeq, and then we could center on resSeq (which is what would be expected). Centering on resid is likely to lead to users reporting residues that are off by one.

That would definitely be a worthwhile addition (I think)

dwhswenson commented 4 years ago

I think it could add confusion. Think of the KRas example, which jumps a bunch of residues. How do you handle the xticks there?

I'm also not keen on too much more matplotlib-specific stuff. I think moving toward holoviews (after we get to 1.0) would be interesting. What I really want, frequently, it to hover over a contact and get the name of the residues involved as well as the value of the frequency. I don't think matplotlib can do that, but bokeh can (and holoviews is kind of a backend-independent adapter).

sroet commented 4 years ago

That is a fair point, let's keep a (possible) x-tick remapping out of this issue. Also the holoviews stuff looks pretty sweet and suitable