NKI-CCB / DISCOVER

DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data
Apache License 2.0
27 stars 6 forks source link

Error in python #14

Open taotao-mars opened 2 years ago

taotao-mars commented 2 years ago

Hi,

One error occurred when I used DISCOVER, could you help me with that? Thanks!

截屏2021-11-27 下午11 54 10

The format of my binary data is:

截屏2021-11-27 下午11 53 49
scanisius commented 2 years ago

Unfortunately, the error message alone is not specific enough to pinpoint the source of this error. So I may need some more information from you.

There is one detail that catches my eye: it looks like the gene names are a regular column in your data frame as opposed to the index, which is what DISCOVER expects. If you read these data with read_table (or read_csv), could you try passing the argument index_col=0 to read_table?

Please try that first. If the above does not fix the problem, I will ask you for some more detailed information.

taotao-mars commented 2 years ago

Hi,

Thanks for your reply. Yes, I added index_col=0 when I read_csv.

截屏2021-11-29 上午11 33 11

I compared the example data frame with my data frame. They look similar, and the same error appeared.

截屏2021-11-29 上午11 34 34
scanisius commented 2 years ago

In this example it seems that all elements of subset are False, which means that pairwise_discover_test receives an empty mutation matrix. That would indeed give the error you are seeing.

With the line df11 = df11.iloc[:5, :5] you have overwritten (probably unintentionally) your full mutation matrix with a small sub-matrix that does not contain any mutations anymore.

taotao-mars commented 2 years ago

Thanks for your reminder. My data is huge, so I want to intercept part of it for testing. My problem was solved when I increased the amount of data.

And also, are there any parameters I should adjust for large data sets? My data has been running for over 13 hours. Thanks

scanisius commented 2 years ago

Good to hear your problem has been solved. In the next update of the DISCOVER package I will add an explicit check for empty mutation matrices so that at least the error message will be more informative.

As for your second question, I assume that the long runtime you report is for the pairiwse_discover_test function, not for the call to DiscoverMatrix. Is that correct? If so, what you can do to speed up the process substantially is to pass the argument fdr_method="BH" to pairwise_discover_test.

This speed up does come at a price though. With the above option you are asking DISCOVER to perform multiple testing correction with the standard Benjamini-Hochberg procedure, as opposed to the default, which uses a discrete version of the Benjamini-Hochberg procedure. The advantage of the discrete version is that it tends to give lower Q values, but the disadvantage is that it takes much more time. In contrast, the standard version (enabled with the fdr_method="BH" argument) is faster, but with the disadvantage of a somewhat reduced sensitivity (i.e. higher Q values). That trade off is yours to make.