NKI-CCB / DISCOVER

DISCOVER co-occurrence and mutual exclusivity analysis for cancer genomics data
Apache License 2.0
27 stars 6 forks source link

Type Error in python #13

Closed SBNoor closed 1 year ago

SBNoor commented 2 years ago

I've created a mut matrix using maf file. It is a binary file as stated in documentation for Python. However, when I run discover.DiscoverMatrix(mut) I get the following error:

discover

Can you give me some insight as to what must be causing this type error? My dataframe is of shape 1367 rows × 3018 columns and looks like:

Screenshot 2021-07-22 at 12 06 57

scanisius commented 2 years ago

Thanks for reporting this issue. A possible cause of this error is that your mut object is not a pandas DataFrame. Could that be the case? If it is a DataFrame, could you tell me what version of pandas you are using?

SBNoor commented 2 years ago

It is pandas version 1.1.5. And I figured it out that I am supposed to use a pandas dataframe and it works. However when I use pairwise_discover_test() for mutual exclusivity I get the following error:

discover

And events is of type discover.data.DiscoverMatrix and subset is of type pandas.core.series.Series. Do I need to install another version of pandas?

scanisius commented 2 years ago

The .ix attribute that is mentioned in the error message was deprecated in pandas version 1.0. So a short-term fix would be to install pandas < 1.0.

A new release of the discover package is planned for early next week, which will contain a fix for this issue. It will also have some speed improvements. So even if you go for the short-term fix, I would recommend to check back next week.

SBNoor commented 2 years ago

Will it be possible to leave a comment here once you've updated the package?

scanisius commented 2 years ago

Sure. I will leave this issue open until the new version is published.

scanisius commented 2 years ago

Version 0.9.4 was released today. Among other things, it fixes the incompatibility with recent versions of pandas.

SBNoor commented 2 years ago

I see that newer version is supposed to be faster. I have a matrix of size 22000x11000. And I've been running the script on HPC for about 3 hours. Can you give a rough estimate about how long it would take normally?

scanisius commented 2 years ago

Indeed the latest release is quite a bit faster. However, yours is an extremely large data set, so this will still need a long time to finish. I am not able to say how long it would take for your data, but you may have to think in the order of days rather than hours. There are a few things I can suggest to try and speed things up:

scanisius commented 1 year ago

This issue has been inactive for a while now, so I am closing it. Please open a new issue if you are still experiencing problems with DISCOVER.