mergeSetChains error "number of supplied new cluster labels does not match the number of clusters in w_chain"

SebastianHollizeck commented 2 years ago

Hi,

i just tried to run PICTOgraph on my dataset with abour 2300 variants in 6 samples and everything went fine until the merging of the chains in the end.

I left everything to default setting when chosing the K for the sets:

> set_k_choices
# A tibble: 59 × 5
   set_name_bin min_BIC elbow  knee chosen_K
   <chr>          <dbl> <dbl> <dbl>    <dbl>
 1 111111             6     2     2       NA
 2 111110             2     2     2        2
 3 111101             3     3     3        3
 4 111100             1     1     1        1
 5 111011             5     2     2       NA
 6 111010             2     2     4       NA
 7 111001             2     2     2        2
 8 111000             2     2     2        2
 9 110111             1     1     1        1
10 110110             1     1     1        1
# … with 49 more rows

I did some digging and it looks like the best_K_vals calculated in the merge which uses the z_chain has value 2 where the w_chain has value 3.

I can obviously set the one set to 2 instead of 3 manually, because this is the BIC plot in question

But I think this qualifies as a bug.

Sadly i dont know how to even adress the issue otherwise.

Thank you for you help, Sebastian

llyzhng commented 2 years ago

If the default values in the chosen_K column are NA, the user needs to spot check and manually fill these in. Are you still getting an error when chosen_K is fully set?

SebastianHollizeck commented 2 years ago

Yes, i tried both the default of not setting the selected k or just using the minimum BIC column for the selection of the chains and either showed the error

SebastianHollizeck commented 2 years ago

Even though the chosen K was 3

> set_k_choices[47,]
# A tibble: 1 × 5
  set_name_bin min_BIC elbow  knee chosen_K
  <chr>          <dbl> <dbl> <dbl>    <dbl>
1 001111             3     3     3        3

I supplied the chosen K how it is calculated in the merge of the chains to the selection. It is all the same but set 47 is 2 instead of 3.

best_K_vals <- unname(sapply(best_set_chains, function(x) max(x$z_chain$value)))
> which(best_K_vals!=set_k_choices$chosen_K)
[1] 47
> best_set_chains <- collectBestKChains(all_set_results, chosen_K =best_K_vals)

But when i then merged the set chains, there were multiple warnings

> chains <- mergeSetChains(best_set_chains, input_data)
Warning messages:
1: Problem while computing `Mutation_index = as.numeric(...)`.
ℹ NAs introduced by coercion 
2: Problem while computing `s = as.numeric(...)`.
ℹ NAs introduced by coercion

When i plotted the cluster assignment it seemed to look okay (even though it is too big to be plotted sensibly)

So i wanted to get the mutation assignment, but it errored

> writeClusterAssignmentsTable(chains$z_chain)
Error in `mutate()`:
! Problem while computing `Mut_ID = Mut_ID`.
✖ `Mut_ID` must be size 1, not 2692.
ℹ The error occurred in group 1: Parameter = z[1].
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `Mut_ID = Mut_ID`.
✖ `Mut_ID` must be size 1, not 2692.
ℹ The error occurred in group 1: Parameter = z[1].
---
Backtrace:
 1. pictograph::writeClusterAssignmentsTable(chains$z_chain)
 6. dplyr:::mutate.data.frame(., Mut_ID = Mut_ID, Cluster = value)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/dplyr:::mutate_error>
Error in `mutate()`:
! Problem while computing `Mut_ID = Mut_ID`.
✖ `Mut_ID` must be size 1, not 2692.
ℹ The error occurred in group 1: Parameter = z[1].
---
Backtrace:
     ▆
  1. ├─pictograph::writeClusterAssignmentsTable(chains$z_chain)
  2. │ └─... %>% arrange(Cluster)
  3. ├─dplyr::arrange(., Cluster)
  4. ├─dplyr::select(., Mut_ID, Cluster)
  5. ├─dplyr::mutate(., Mut_ID = Mut_ID, Cluster = value)
  6. ├─dplyr:::mutate.data.frame(., Mut_ID = Mut_ID, Cluster = value)
  7. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
  8. │   ├─base::withCallingHandlers(...)
  9. │   └─mask$eval_all_mutate(quo)
 10. ├─dplyr:::dplyr_internal_error(...)
 11. │ └─rlang::abort(class = c(class, "dplyr:::internal_error"), dplyr_error_data = data)
 12. │   └─rlang:::signal_abort(cnd, .file)
 13. │     └─base::signalCondition(cnd)
 14. └─dplyr `<fn>`(`<dpl:::__>`)
 15.   └─rlang::abort(...)

Is this still connected to the original bug, or is this a different issue?

llyzhng commented 2 years ago

Thanks for bringing this to our attention. The method hasn't been tested on datasets as large as yours. Could you share the input data via email?

SebastianHollizeck commented 2 years ago

While this is human data, this is only a downsampled version of all somatic variants called in these samples, so i can share it here. The original dataset contains about 60K variants, the ones I used as input are protein altering variants where copy number information was available.

The following link contains the download for the RDS which contains the input list I used. https://cloudstor.aarnet.edu.au/plus/s/efWBYj3wlD3Wanq This link will only be accessible for a week.

jiaying2508 commented 1 year ago

Hi Sebastian, Sorry for the slow response. We've had some personnel turnover on our end and we were finally able to look at your problem. We discovered that in your input file, while the dimensions for y, n, and tcn are 2692 x 6, the dimensions for m are 2735 x 6. To fix the error, you would need to make the dimensions consistent. We have also made adjustments to the code to handle samples with very large number of mutations. If you fix the input and update to the latest version of PICTograph (v1.2.0.1), you should be able to run your job without any problems.

KarchinLab / pictograph

mergeSetChains error "number of supplied new cluster labels does not match the number of clusters in w_chain" #7