PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Getting duplicated error #18

Closed beginner984 closed 5 years ago

beginner984 commented 5 years ago

Hi,

Sorry, I have two SV samples from cancer and its matched model called by **Manta

** as I have attached here; When I am trying to compare with truth set and plot ROC curve I am getting this error

https://www.dropbox.com/s/7upp1517w4zjw0j/c005.vcf?dl=0

https://www.dropbox.com/s/723i1i6uvadujax/o005.vcf?dl=0

> svgr$truth_matches <- countBreakpointOverlaps(svgr, truth_svgr,
+                                               # read pair based callers make imprecise calls.
+                                               # A margin around the call position is required when matching with the truth set
+                                               maxgap=100,
+                                               # Since we added a maxgap, we also need to restrict the mismatch between then
+                                               # size of the events. We don't want to match a 100bp deletion with a 
+                                               # 5bp dupliaction. This will happen if we have a 100bp margin but don't also
+                                               # require an approximate size match as well
+                                               sizemargin=0.25,
+                                               # We also don't want to match a 20bp deletion with a 20bp deletion 80bp away
+                                               # by restricting the margin based on the size of the event, we can make sure
+                                               # that simple events actually do overlap
+                                               restrictMarginToSizeMultiple=0.5,
+                                               # HYDRA makes duplicate calls and will sometimes report a variant multiple
+                                               # times with slightly different bounds. countOnlyBest prevents these being
+                                               # double-counted as multiple true positives.
+                                               countOnlyBest=TRUE)
 Error in .assertValidBreakpointGRanges(query) : 
  Breakpoint GRanges names cannot duplicated 
> ggplot(as.data.frame(svgr) %>%
+          dplyr::select(QUAL, caller, truth_matches) %>%
+          dplyr::group_by(caller, QUAL) %>%
+          dplyr::summarise(
+            calls=n(),
+            tp=sum(truth_matches > 0)) %>%
+          dplyr::group_by(caller) %>%
+          dplyr::arrange(dplyr::desc(QUAL)) %>%
+          dplyr::mutate(
+            cum_tp=cumsum(tp),
+            cum_n=cumsum(calls),
+            cum_fp=cum_n - cum_tp,
+            Precision=cum_tp / cum_n,
+            Recall=cum_tp/length(truth_svgr))) +
+   aes(x=Recall, y=Precision, colour=caller) +
+   geom_point() +
+   geom_line() +
+   labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")
Error in data.frame(seqnames = as.factor(seqnames(x)), start = start(x),  : 
  duplicate row.names: MantaINS:0:0:0:0:0:0_bp1, MantaDEL:27:0:0:0:0:0_bp1, MantaINS:31:0:0:0:0:0_bp1, MantaINS:45:0:0:0:0:0_bp1, MantaINS:41:0:0:0:0:0_bp1, MantaINS:50:0:0:0:0:0_bp1, MantaINS:35:0:0:0:0:0_bp1, MantaINS:62:0:0:0:0:0_bp1, MantaDEL:57:0:0:0:0:0_bp1, MantaINS:68:0:0:0:0:0_bp1, MantaDEL:85:0:0:0:0:0_bp1, MantaDEL:99:0:0:0:0:0_bp1, MantaINS:65:0:0:0:2:0_bp1, MantaDEL:97:0:0:0:0:0_bp1, MantaDEL:106:0:0:0:0:0_bp1, MantaDEL:128:0:0:0:0:0_bp1, MantaDEL:123:0:0:0:0:0_bp1, MantaDEL:115:0:0:0:0:0_bp1, MantaINS:145:0:0:0:0:0_bp1, MantaDEL:131:0:0:0:0:0_bp1, MantaINS:141:0:0:0:0:0_bp1, MantaINS:143:0:0:0:0:0_bp1, MantaINS:160:0:0:0:0:0_bp1, MantaINS:167:0:0:0:0:0_bp1, MantaDEL:192:0:0:0:0:0_bp1, MantaINS:197:0:0:0:0:0_bp1, MantaINS:212:0:0:0:0:0_bp1, MantaDEL:233:0:0:0:0:0_bp1, MantaDEL:221:0:0:0:0:0_bp1, MantaINS:203:0:0:0:0:0_bp1, MantaDEL:231:0:0:0:0:0_bp1, MantaDEL:242:0:0:0:0:0_bp1, MantaDEL:249:0:0:0:0:0_bp1, MantaDEL:236:0:0:0:0:0_bp1, MantaINS:250:0:0:0:0:0_bp1, Mant

I need a plot to visualize the relationship of these samples for SVs

Any help please?

d-cameron commented 5 years ago

Breakpoint GRanges names cannot duplicated

It sounds like you have multiple rows with the same identifier. Did you concatenate two different manta VCFs into the same GRanges object? You need to make sure all row names are unique so you can either a) generate the ROC data frame for each sample then bind_rows them together at the end (this is the approach I used in my benchmarking paper), or make sure all row names are unique (e.g. by updating the names() and the $partner with a sample identifier.