etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
520 stars 163 forks source link

Start and end coordinates switched - is this expected? #677

Open kkchau opened 2 years ago

kkchau commented 2 years ago

Output of segmentrics resulted in a call with start > end (cnvkit==0.9.9):

Command: cnvkit.py segmetrics -s <<output_from_segment_command>> <<output_from_fix_command>> -o <<output>> --ci --bootstrap 100

Result:

chromosome      start   end     gene    log2    depth   probes  weight
...
4       14258435        15448470        -       -0.329143       0.000253774     1       0.998585
15      46657639        10488   -       -0.41756        0       1       0
...

Is this to be expected? Should we include a post-processing step to filter out this 0-weight segment? Thank you for your advice.

etal commented 2 years ago

That is not expected. It's also surprising that each of those segments has only 1 probe on it. Could you share the corresponding lines of the input .cns file?

kkchau commented 2 years ago

Sure, here are the relevant lines:

From input cns file:

...
15      46657639        10488   -       -0.41756000000000004    0.0     1       0.0     hmm-germline:0.0
...

Note that we concatenate the results of multiple segmenters into one input cns file and add a new column for method:weight (in this case, hmm-germline with a weight of 0). This has worked for us in the past (we've recently upgraded from 0.8.5 to 0.9.9), and still continues to work, aside from this situation.

From cnr file:

chromosome      start   end     gene    log2    depth   weight
1       10500   176917  Antitarget      -9.84839        0       0.934803
1       318219  470868  Antitarget      -9.04625        0       0.931417
1       570871  1221929 Antitarget      -9.16632        0       0.99401
...
15      46657094        46657139        -       -28.5108        0
15      46657639        47538217        Antitarget      -0.41756        0.000222581     0.997043
15      47538217        48418796        Antitarget      -7.24583        0       0.98933
15      48419296        48419340        SLC24A5 -28.7406        0
...
etal commented 2 years ago

Thanks for the details. Since the segment has 0 weight I suppose it makes sense for you to add a post-processing step to remove 0-weight segments; it's not clear to me what the behavior should be when a segment and/or all data points supporting it are bogus, but scrambling the start and end coordinates is not particularly helpful. I'll see if I can figure out why this is happening internally.

kkchau commented 2 years ago

Hi @etal,

Thanks for looking into this. I have another example of this issue that might be helpful:

Ratios:

chromosome  start   end gene    log2    depth   weight
...
13  94293168    94293212    GPC6    -6.12509    0.0227273   0.91648
13  94293212    94293257    GPC6    0.821666    1   0.919367
13  94293257    94293302    GPC6    0.812616    1   0.928625
13  94293302    94293347    GPC6    -0.193662   1   0.928721
...

Segments:

chromosome  start   end gene    log2    depth   probes  weight  method_weight
...
13  94293168    10488   -   -1.1598 0.0 4   0.0 hmm-germline:0.0
...

We see that the segment 13:94293168-10488 uses four probes all at > 0.9 weight, just the corresponding end coordinate doesn't match the end coordinate of that last probe. Maybe 10488 represents some kind of internal code? This second case also occurs with hmm-germline.