sroet commented 3 years ago

Should close #89

[non-public API-break] This code breaks some non-public API, see Other Changes below

Description of the new behaviour: For all ContactMaps, except ContactDifference (and its subclasses): It will contacts.plot() will plot query/query_residues on the x-axis and haystack/haystack_residues on the y-axis.

For ContactDifference the query and haystack and the _residue equivalents might not make sense so instead they use the new default, which tries to plot all points with as little amount of "double" (due to symmetry) points as it can.

The new n_x, n_y defaults

We construct the lower triangle of our data by sorting the keys (indices) and then we look at x and y dimensions of the smallest rectangle required to cover all the datapoints in this triangle. Effectively trying to see if there is a set of low and a set of high indices that can be used to describe all data points (like a query and a haystack). We then take the shortest of the 2 dimensions as n_x and the longest as n_y.

Other changes

The sparse_matrix and pandas.DataFrame now are guaranteed to be square, max_size by max_size (being effectively equal to the old implementation for all classes.
The plot warning now takes into acount the actual range of data being plotted
haystack_residues_indices and query_residues_indices are now calculated and stored at the initialization.
[non-public API-break] With the above point _range_from_object_list() was not needed anymore and removed.
Added a query_range and haystack_range like their *_residue_range equivalents.

Plotting everything again

The following code can be used to plot all points of the contact map:

fig, ax = residue_contacts.plot()
ax.set_xlim(*residue_contacts.total_range)
ax.set_ylim(*residue_contacts.total_range)

Some other info that was here before the last edit:

Things I like some feedback on (preferably from @dwhswenson ): This was offered as a possibility for the user:

fig, ax = residue_contacts.plot()
ax.set_xlim(0, residue_contacts.max_size)
ax.set_ylim(0, residue_contacts.max_size)

~What should total_range be? The total range we have data (min/max on the counter)? or 0 to topology.n_atoms/topology.n_residues (this would require some more code as topology is not something the ContactCounts have)?~

Nvm, we need the size anyway for casting to matrices in a sensible way, so I added it as a keyword.

~Contains similar code as #95, but that can be split out if this is ready to be merged before that one is in~

Current issues to solve:

[x] this does not work (yet) for contact differences
[x] examples need updating
[x] tests don't (yet) pass
[x] add test for the new code

sroet commented 3 years ago

Alright @dwhswenson this one is ready for a review, please let me know if this is what you intended by #89

sroet commented 3 years ago

@dwhswenson A friendly ping to make sure this is still on your radar :wink:

codecov[bot] commented 3 years ago

Codecov Report

Merging #94 (646f10b) into master (c9d1451) will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #94   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           13        13           
  Lines         1066      1125   +59     
=========================================
+ Hits          1066      1125   +59

Impacted Files	Coverage Δ
contact_map/contact_count.py	`100.00% <100.00%> (ø)`
contact_map/contact_map.py	`100.00% <100.00%> (ø)`
contact_map/plot_utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c9d1451...646f10b. Read the comment docs.

sroet commented 3 years ago

@dwhswenson , friendly monthly ping to make sure it is still on a list somewhere :)

sroet commented 3 years ago

@dwhswenson Just FYI; this PR needs to be squashed before merging, as it contains multiple commits altering all the notebooks, which will lead to repository bloat.

sroet commented 3 years ago

Also, I still don't agree with the codeclimate failure

dwhswenson commented 3 years ago

@sroet : Could you resolve conflicts on this branch? I'm hoping to try reviewing this soon. Since this has a lot of notebook changes, I want to hold off on notebook updates in #90 and #110 until this is in, to limit conflicts.

sroet commented 3 years ago

@sroet : Could you resolve conflicts on this branch? I'm hoping to try reviewing this soon. Since this has a lot of notebook changes, I want to hold off on notebook updates in #90 and #110 until this is in, to limit conflicts.

All done :)

dwhswenson commented 3 years ago

Also, I still don't agree with the codeclimate failure

Ah, now I remember these. (Old link had stopped working) Yes, I disagree as well. It's clear that they add "mass" to their code similarity measure when you use kwargs. That is nonsense. When these first popped up, I actually started going into their codebase to try to figure out how I might fix it, but between the fact that most of the code is CodeClimate extensions to a 3rd party code, plus the fact that it is mostly written in Ruby, I gave up.

dwhswenson commented 3 years ago

Before I get into the code, a question about the overall approach:

If I understand correctly, the x vs y axis can be either query or haystack depending on the number of residues. I think that may cause confusion for the users. Isn't it easier to just say that default x-axis is the query and default y-axis is the haystack? This seems like it would be more consistent (and, in any case, give identical results for users who follow our recommendation to keep the query smaller than the haystack.)

sroet commented 3 years ago

If I understand correctly, the x vs y axis can be either query or haystack depending on the number of residues. I think that may cause confusion for the users. Isn't it easier to just say that default x-axis is the query and default y-axis is the haystack? This seems like it would be more consistent (and, in any case, give identical results for users who follow our recommendation to keep the query smaller than the haystack.)

Not really, that is just the ContactCount default (which has no idea of the existence of queries or haystacks).

For all classes that use ContactsDict.atom_contacts (or residue_contacts) we explicitly set n_x to query_range and n_y to haystack_range.

If we make a ContactCount from somewhere else (like in ContactDifference), then we use the data inside the counter to get the best guess (these might not even be complete queries or haystacks if there are certain contacts missing in the counters)

sroet commented 3 years ago

@dwhswenson Did the refactor (moved relevant code to plot_utils), and added the plot to the "changing the defaults" ipynb please have another look

sroet commented 3 years ago

Thought: Should ContactPlotRange be public API?

I have no preference on this, so I will make it private.

dwhswenson commented 3 years ago

LGTM. Squash-merging as suggested, given many changes to notebooks (almost forgot to do that!)

dwhswenson / contact_map

[non-public API-break] Autozoom plots #94

The new n_x, n_y defaults

Other changes

Plotting everything again

Some other info that was here before the last edit:

Codecov Report