RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.04k stars 184 forks source link

Blog post Tag prediction/recommendation? #145

Open elfeffe opened 3 years ago

elfeffe commented 3 years ago

I want to recommend tags for my blog posts, I plan to have the text, and I need to receive tags. Any recommendation about where to begin? Any example that I can see?

carmelosantana commented 3 years ago

I know you're not directly asking a blog or WordPress question but I can address at least the immediate issue. There are SEO plugins for WordPress that can help with keyword recommendation. This could be used for tags as well.

I'm also curious how something like this would be implemented. The WordPress side would be easy once the tags were available.

Something to keep in mind here is this would also require some server knowledge to install and setup all the necessary dependencies.

andrewdalpino commented 3 years ago

Hey @elfeffe that's a great use case

The way that you'd approach the problem with machine learning would be to (either yourself or someone else) start labeling a portion of the blog posts in your database with tags by hand. Pair each sample (which may include things like the title and body of the post) with a unique tag (you will duplicate samples for posts with multiple tags). For however many samples you want to self-annotate, this will be your dataset. I would recommend setting aside about 20% of the data for cross-validation. The bigger your dataset, the better your results will be.

When inferring tags of new blog posts, you'll call the proba() method on a Probabilistic estimator to output the probabilities of every possible tag, sort them by their probability and take the top k above a threshold as the inferred tags for example. The lower this threshold the more tags you'll obtain but they may be junk if set too low.

The Sentiment example, is a good reference for natural language problems such as this one. The task is different, but many of the preprocessing steps are the same.

elfeffe commented 3 years ago

Great, thank you, I will check that. This is not for a WP blog, and the main idea is to learn how to use this library. We have 20000 posts with multiple tags (added by hand). I will check how to begin. Thank you guys. Happy new year.

elfeffe commented 3 years ago

@andrewdalpino will be useful to remove common words (for, from, at, the) and remove accents (from Spanish words). Or it’s useless?

elfeffe commented 3 years ago

Ok. That’s the tdidf.

andrewdalpino commented 3 years ago

@andrewdalpino will be useful to remove common words (for, from, at, the) and remove accents (from Spanish words). Or it’s useless?

That's a good question, and I'm not sure there's a good answer for that except to do some experiments to see which works best for the data you have. To remove common words (a.k.a. stop words) you can try a couple of different strategies. You can use the parameter max document frequency on Word Count Vectorizer to bar stop words from entering the vocabulary, or you can filter stop words from the dataset before tokenizing the blobs using Stop Word Filter.

https://docs.rubixml.com/en/latest/transformers/word-count-vectorizer.html

https://docs.rubixml.com/en/latest/transformers/stop-word-filter.html

Ok. That’s the tdidf.

Not quite, but your intuition is good. Term Frequency - Inverse Document Frequency (TF-IDF) is a weighting scheme applied to the raw term counts created by Word Count Vectorizer such that common words are given less weight than rarer words. This is slightly different from removing the token completely from the bag of words in that some weight is still given to the occurrences.