etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
540 stars 164 forks source link

Default y-min in chromsome scatter plots #643

Closed micknudsen closed 3 years ago

micknudsen commented 3 years ago

I wonder if this should be min instead of max?

https://github.com/etal/cnvkit/blob/d4145c5e9b97817c2d625f07b44212c4176b5a1f/cnvlib/scatter.py#L370

With the current default settings, some biallelic deletion are not shown. Here is an example (CNVkit v0.9.9) without --y-min specified:

Screen Shot 2021-07-08 at 16 39 44

and with --y-min=-10:

Screen Shot 2021-07-08 at 16 39 50
tetedange13 commented 3 years ago

Hi @micknudsen,

Not an author of CNVkit, but you are probably right => I think expected behaviour was (ignoring small plotting corrections):

if -0.5 < y.min():
    y_min = -0.5  # No matter the value of "y.min()"
else:  # In other words: "y.min() < -0.5"
    y_min = y.min() 

Just to precise your issue, minimal command to reproduce is: cnvkit.py scatter -c chr4 a_sample.cnr => In other words this is happening as soon as scatter is asked to focus on a specific chromosome / region (no need to have segment, nor b-allele) => As you suggested, seems to be fixed by replacing max() by min() at line you highlighted

Thanks a lot for reporting this !

Hope this helps. Have a nice day. Felix.

tskir commented 3 years ago

This old issue of mine might be related: #385.

However, I'm not sure this behaviour is unexpected. I think (and @etal can confirm/deny) the rationale was to prevent very negative scores (such as the ones from biallelic deletions) from compressing the Y axis. Indeed, as you can see on your second graph, variations in most of the signals become hard to see now that the range on the Y axis is so wide.

However, I agree that the data shouldn't just disappear like it happens now. The best solution I can think of is to keep the current default threshold (–5), but to clip the data to that threshold rather than hiding it. @etal @tetedange13 @micknudsen what do you think of this proposal?

micknudsen commented 3 years ago

Maybe one could put a hard lower bound on the y-axis – as it is now – and then indicate segments falling below this with a line (or some other indicator) below the axis? Something along the line (pun intended) in which IGV indicates downsampling of reads by showing a thick, black line above the alignment. This could also be applied to very high values.

tetedange13 commented 3 years ago

@tskir ,

The problem I see with clipping, is cases like @micknudsen has: important number of data located relatively at same coordinates => If clipped, they will turn into a big strange cluster of points at -5

Idea of a line looks cool and not that hard to implement => Could be plotted as soon as any data < -5 (absent otherwise) and disabled when user passes --y-min and/or --y-max => Maybe we could aslo add a tiny vertical line on top of horizontal one, to discretely mark coordinates of masked data ?

No matter the choice made, I think we should also add more logging about this => Because for now data simply disappear from the plot, as you said and no warning is raised => In most case it is OK, because data with log_ratio<-5 are often "antitarget" one, with little importance, but can be risky

tetedange13 commented 3 years ago
etal commented 3 years ago

Lemme look at this a little more.

The intended behavior was:

Aside from the segments, there are almost always lots of spurious single-bin datapoints (probes) and I think it's OK to ignore these in setting the plot limits and focus our attention on the segments.

I agree that showing clipping on the y-axis is much better than failing to plot those segments at all. I'm not sure the rug plot is exactly the right solution. I'd prefer to keep the color the same as the segments, only represent the segments (ignore individual probes, unless the input was only .cnr) and cling to the bottom spine rather than add another h-line. Maybe a rug plot with orange points facing upward, on a black line identical to the plot's bottom border? Or just spans of orange on that line, or a small triangle on the bottom to indicate each clipped segment?

tetedange13 commented 3 years ago

Hi @etal,

Thanks for your feedback ! This makes sense to rather focus on masked segments => Is this the kind of representation you have in mind for masked segments? (small orange triangle at bottom-right): masked_segment

Corresponding code:

    if segments and hidden_seg.sum():
        x_hidden = segments.start[hidden_seg] * MB
        y_hidden = np.array([y_min] * len(x_hidden)) + 0.2
        axis.scatter(x_hidden, y_hidden, marker='^', linewidth=3, snap=False,
                     color=segment_color)

Have a nice day. Felix.

etal commented 3 years ago

Yes, that mockup is very close. To ensure it's not mistaken for an ordinary segment, I'd like to avoid rounded corners (so it looks sharp and somewhat alarming), and ideally, overlap the plot's lower spine (higher z-order or something like that, so it obscures the plot boundary itself and is clearly "breaking through" -- if that's possible). To avoid rounding, maybe switch linewidth=3 to linewidth=1 or =0 in the call to axis.scatter?

tetedange13 commented 3 years ago

Hi @etal,

Result bellow is still with linewidth=3 (tell me if triangles size LGTY ?) => I used "amplicon.{cnr,cns}" from CNVkit test files Figure_1

Have a nice day. Felix.

etal commented 3 years ago

Hey Felix, this looks great! If you have a PR for this feature I'd be very happy to merge it.

tetedange13 commented 3 years ago

Hi @etal ,

I updated #645

Have a nice day. Felix.

etal commented 3 years ago

Great, I've reviewed and merged #645. This feature is now part of CNVkit.