advice on a particular task?

bdewilde commented 3 years ago

Hi! Now that spacy v3 and thinc v8 are close to release, I'm starting to explore ways to leverage the new functionality in textacy. One particular task that I'd love to implement via thinc and/or spacy is language identification, but I'm embarrassed to admit that I haven't been able to make it work.

Here's the method, which closely follows Google's CLD3 (tbd, pending experimentation / optimization):

extract character ngrams from input text for n = [1, 2, 3]
for each n, compute counts for distinct character ngrams, then divide by the total count to get relative fractions
represent each distinct character ngram as a dense embedding vector learned during training
for each n, average embedding vectors together weighted by relative fractions of occurrence
concatenate averaged embedding vectors for each n
pass embedding layer through a hidden Relu layer and on to an output Softmax layer for prediction

The current setup uses scikit-learn's hashing vectorizer and multilayer perceptron classifier: https://github.com/chartbeat-labs/textacy/blob/master/src/textacy/lang_utils.py#L191-L222 . It's fine but not great. Here's a pseudo-code thinc version I've come up with:

chain(
    concat(
        chain(extract_ngrams, [get_normalized_counts?], HashEmbed, reduce_mean)
        for n in [1, 2, 3]
    ),
    Relu(nO=n_hidden, dropout=dropout),  # hidden layer
    Softmax(),  # output layer
)

I've been struggling with data validation errors in those initial transformation steps, and haven't been able to make it over the hump into the more familiar territory of relu and softmax layers. Any advice you can offer would be hugely appreciated! And totally understand if you don't want to get into the habit of advising users on proper usage of your tools... 😇 Just figured I'd ask, in case I'm totally on the wrong track.

svlandeg commented 3 years ago

Hi Burton! It would be great to see a Thinc-version of your implementation, I hope you can get this working :-)

My main advice would be to think about the input and output data types flowing through your layers. I would make a schematic overview with the different layers you need, and what their types are, to see if everything matches. If the types match, that wouldn't necessarily mean your final model is correct, but if they don't match, you'll definitely have a problem. And these may be the source of data validation errors.

And as an even more generic piece of advice: start with the most simple model first. Like just one n-gram model, and the minimal amount of layers you need on top to get to a correct output shape. See if that trains, even if the final result is bad. Loss should still go down and you should be able to overfit on just a few examples. Then start making the model more complex.

You may have already found this, but spaCy's Thinc model implementations are all here: https://github.com/explosion/spaCy/tree/develop/spacy/ml/models. Especially the textcat ones should be an inspiration for your specific challenge.

I hope this helps at least some ;-)

bdewilde commented 3 years ago

Thanks @svlandeg ! I did try more or less what you suggested -- starting simple with unigrams only, then building up -- but got distracted mid-work by a paper showing great performance with a transformer-based language identification model. Which led me over to huggingface. So much for simple! 😄

After several hours hacking with thinc, I think I most needed a wider variety of usage examples, even micro-examples for particular layers / components of the library. For example, I spent an embarrassing amount of time tinkering with with_array() to make it work with my other layers. With a more established lib like PyTorch, there are tons of usage examples floating around the web, and you can usually find something at least similar to what you want to do, while thinc is still new and a bit niche. Building a model on raw characters rather than, say, spaCy tokens meant that I had to extrapolate further from the examples in spacy/ml/models, and then I got lost in the technical weeds.

I'm not entirely sure that I'm a target user for thinc, so maybe my difficulties aren't representative or common. And I have a nagging suspicion that I'm fundamentally misunderstanding something about how this lib actually works. 😅 But will follow up if/when I have a fancy new language identification model! Feel free to close this issue out in the meantime, I'm sure y'all don't want to get in the habit of offering task support via GitHub issue. Thanks again!

svlandeg commented 3 years ago

You're right that we don't have the resources to provide that level of support, but it's good to hear your feedback nonetheless. We are working on some tutorials/videos that provide more technical details & explanations about various new concepts in spaCy 3 and Thinc 8, and we'll also address implementing models in Thinc. I see your point that more examples would be useful!

FYI - in case you hadn't found it yet, just wanted to point you towards https://github.com/explosion/spacy-transformers, which provides support for HuggingFace Transformer models in your spaCy pipeline, more docs here: https://nightly.spacy.io/usage/embeddings-transformers

I actually wonder wether you couldn't rely on a built-in spaCy textcat architecture, where you'd swap out the tok2vec sublayer in favour of a Transformer?

Anyway, so many options ;-)

explosion / thinc

advice on a particular task? #411