Closed hazardsy closed 5 years ago
But train_data
has the data type List[Tuple[str, Dict[str, float]]]
. How do you turn that into a numpy array?
I updated my original post to make my modifications clearer.
I only did a simple train_data = np.array(train_data)
in order to have a numpy.ndarray[Tuple[str, Dict[str, float]]]
.
Does...that work? I had no idea you could have numpy arrays of tuples. And surely the array can't have a dict...Like, how would that work?
I am not very experienced with numpy technicalities, but the following code :
import numpy as np
data = [("text", {"catA": False, "catB": True})]
print(data)
npdata = np.array(data)
print(npdata)
Gives the following result :
[('text', {'catA': False, 'catB': True})]
[['text' {'catA': False, 'catB': True}]]
From my understanding, what numpy does behind the hood is turn the tuple into an array. What I don't understand is why this only has a minor effect on model performances instead of just either raising an exception because the type is wrong or completely destroy the performances. Also this happening only when using a pretrained model seems quite peculiar.
I'm guessing some sort of data type check is failing...Or possibly there's some datatype conversion that's unideal? Or maybe it messes up the shuffling? Either way, the solution would be "don't do that" I guess.
There's no benefit to calling numpy.array()
on an arbitrary Python list like that. The result isn't really an array, it's just a list with a different name, and maybe different problems. That's why I was surprised it would work --- it doesn't do anything useful.
Originally the reason I did it was so I could index my data using a list of indexes to perform cross validation using the standard Scikit Learn helpers.
I agree that the conversion to a numpy array is not necessary to do this but I believe it is a fairly usual and documented way that is generally recommended on StackOverflow threads.
The fact that is messes up completely silently and in a very hard to detect way is the big issue here in my opinion as it could lead developpers to think their results are much worse than they actually are.
Maybe adding a simple type check with a warning message or adding a quick paragraph in the documentation could be enough to tackle the issue without having to change the core in any meaningful way.
The indexing thing is a neat point, but I still really dislike that numpy lets you make these not-actually-array objects out of containers of arbitrary Python objects.
I had another look at what might be wrong, and I think it probably is random.shuffle()
. Have a look at this:
>>> a = array([(0, {"a": 1}), (1, {"b": 2})])
>>> a
array([[0, {'a': 1}],
[1, {'b': 2}]], dtype=object)
>>> random.shuffle(a)
>>> a
array([[0, {'a': 1}],
[0, {'a': 1}]], dtype=object)
So if you replace the line random.shuffle()
in your loop with numpy.random.shuffle()
it should work.
I definitely sympathise that your user experience has not been great. However, assuming this random.shuffle()
thing is the answer, I do think spaCy's done everything right here. The training loop is in your code, so you're free to call the correct function numpy.random.shuffle()
given the (unexpected) data type you're using. We're also duck-typing correctly, so that you can use the data type you find convenient.
This is actually an example of why we try to avoid "stealing the control flow". If we have a choice between a function that operates on a sequence and a function you call within a loop, we prefer to let you write the loop. This makes the API a bit less concise that sklearn's .fit()
method, but it does give you more control.
Indeed using np.random.shuffle()
does seem to solve the issue. I was miles from thinking the issue would come from random.shuffle
.
I definitely agree with your point of view concerning what is expected of spaCy as a library.
Thank you for your time on this matter :)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I am basically using the code from https://spacy.io/usage/training#textcat only adding a
training_data = np.array(train_data)
before starting the training. Evaluation metrics seem to be significantly lowered because of it while the loss remains the same. The only code I added is the following (full code at the end) :Normal results :
np.array results :
It thought this might be linked with numpy turning tuples into lists but manually doing this myself does not change the performances at all. For more precision, this only happens when loading a pretrained model, not when using a blank. I had this happend using french models as well.
Info about spaCy
Code