Closed saucisson closed 6 years ago
@mencattini i have prepared everything for a Travis build (https://travis-ci.org/cui-unige/mcc4mcc). The build does not work because of a FIXME and a linter warning. Can you fix them?
Branch is issue-11
.
I come back from vacations tomorow. I begin the fix on friday.
Or it can also want monday ;-)
The build failure come from mcc.py
on the line 192 with :
# FIXME: i am not sure the result is correct, because there is no check
# that the fields of the characteristic have the same name as the
# fields that were used during learning.
Should i delete it ? I'm not sure to be able to fix it, at leat we need to talk about.
Do not delete it, we will discuss it monday.
The training part just uses the values. It means the model doesn't know the categories, it uses the arrays. It will be our job to be sure that the next array will be in the same form as previous. If we preserve the ordre, there wouldn't have any ambiguity.
But we do not set an order. Instead, we name fields...
Scikit doesn't use Pandas object.
Every function, like fit
or score
take as parameters :
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
The definition of an array-like
from Numpy :
In general, numerical data arranged in an array-like structure in Python can be converted to arrays through the use of the array() function. The most obvious examples are lists and tuples. See the documentation for array() for details for its use. Some objects may support the array-protocol and allow conversion to arrays this way. A simple way to find out if the object can be converted to a numpy array using array() is simply to try it interactively and see if it works! (The Python Way).
At this point, I wasn't sure if Scikit use the headers or not. By reading the Scikit code, i descover that in every function with a array-like
paramter, they cast the vector by applying the check_array
. The Scikit documentation says :
Input validation on an array, list, sparse matrix or similar. By default, the input is converted to an at least 2D numpy array. If the dtype of the array is object, attempt converting to float, raising on failure.
In the model, Scikit doesn't use the header, only some numpy arrays. It means order matters.
Then, it means that we should convert all dictionaries passed to the learning algorithms into arrays, by setting ourself the order of the fields (and pass it also to the mcc.py
script (through the learned.json
file).
It is strange that the mcc.py
tool always obtains a tool identifier (> 2) when doing model.predict (pandas.DataFrame ([test]))
.
It's a kind of preprocessing before using the mcc
tool right ?
No, it can be computed during extract.py
, saved in learned.json
and loaded at the beginning of mcc.py
(in fact when loading learned.json
).
Proof of DataFrame ordering.
import pandas as pd
import numpy as np
keys = np.array([ele for ele in 'abcdefghijklmnopqrstuvwxyz'])
np.random.shuffle(keys)
print(f"Are keys different for alphabet : {np.any(keys != np.random.shuffle(keys))}")
d = {}
# random key insertion
for key in keys:
d[key] = key
df = pd.DataFrame([d])
print(f"Are values order same from keys to dataframe : {np.all(np.array(np.array([ele for ele in 'abcdefghijklmnopqrstuvwxyz'])) == df.as_matrix())}")
pycodingstyle
on python code;autopep8
if needed;pylint
on python code;