8080labs / ppscore

Predictive Power Score (PPS) in Python
MIT License
1.1k stars 164 forks source link

Numpy arrays and Unknown label type: 'continuous' #16

Closed clementpoiret closed 4 years ago

clementpoiret commented 4 years ago

Hi, I'm quickly experimenting by implementing ppscore in my pipeline for the assessment of functional connectivity between brain regions, and I noticed two things: 1/ I think we should be able to use pps.matrix() even on a 2D numpy array when we don't have explicit column names: as of now, it is raising the error AttributeError: 'numpy.ndarray' object has no attribute 'columns' 2/ I got a strange error telling me that "continuous" is an unknown label. File "/home/clementpoiret/anaconda3/envs/nilearn/lib/python3.8/site-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets raise ValueError("Unknown label type: %r" % y_type) ValueError: Unknown label type: 'continuous' Code to reproduce the error:

import numpy as np
import pandas as pd

X = pd.DataFrame(np.random.randn(10,10))
pps.matrix(X)

The error is solved by passing task='regression'. I have sklearn 0.23.0 Maybe an additional comment: maybe that the diagonal of the resulting matrix should be 1, because it makes sense that the predictive power of a vector on itself is 1, no?

8080labs commented 4 years ago

Hi Clement,

thank you for your suggestions.

1) I think that makes sense and we should be able to add this easily to the API. 2) I think we already saw a similar error which occurs when the value is a float but the task is a regression. this can also be fixed via changing the dtype of the series. 3) That makes totally sense that the diagonal should be 1 and this should also be the case. In which example of yours was the diagonal not 1?

Thank you, Florian

clementpoiret commented 4 years ago

Hi Florian,

Thanks for your answer. It's occuring when I use the pps on timeseries extracted from an fMRI. But it's also occurring with the code in the original post, for example I just ran:

import numpy as np
import pandas as pd
import ppscore as pps

X = pd.DataFrame(np.random.randn(10,10))
pps.matrix(X, task='regression')

and it returned me the following matrix:

>>> pps.matrix(X, task='regression')
          0        1         2  3  4         5       6  7  8  9
0  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
1  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
2  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
3  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
4  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
5  0.000000  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
6  0.000000  0.00000  0.000000  0  0  0.085524  0.0000  0  0  0
7  0.000000  0.29528  0.000000  0  0  0.000000  0.0422  0  0  0
8  0.255183  0.00000  0.000000  0  0  0.000000  0.0000  0  0  0
9  0.000000  0.00000  0.027208  0  0  0.000000  0.0000  0  0  0
8080labs commented 4 years ago

Thank you for the example. When passing a task to the matrix, this bypasses the logic for the diagonal

I would love to see your example with the timeseries data in case that it is not under an NDA. If you want, we could have a quick video session about it

Florian

clementpoiret commented 4 years ago

Sorry for the delay, I have some deadlines soon with the end of my MSc and the beginning of my PhD so I don't have a lot of free time, but I'd be happy to discuss about the potential benefits of the pps in neuroimaging! If you want to take a look, here is the repo of the script where I added support for pps : https://github.com/clementpoiret/fmri_connectivity_measures

FlorianWetschoreck commented 4 years ago

To summarize this issue:

If you want to discuss the pps in neuroimaging, please open a new issue :)