flora-pm / flora-server

A package index for the Haskell ecosystem
https://flora.pm/about
Other
129 stars 41 forks source link

Tag extraction from packages #115

Open tchoutri opened 2 years ago

tchoutri commented 2 years ago

We can extract tags from packages using RAKE. This will require tuning, filtering and more automated filtering. The datalog can be useful.

This requires the following:

In terms of normalisation, we can learn a great deal from lib.rs:

I normalize keywords to kebab-case, except CJK and a few exceptions like "iOS" which looks silly. I had to manage synonyms mostly manually: https://gitlab.com/crates.rs/crates.rs/-/blob/main/data/tag-synonyms.csv Joining adjacent keywords into pairs helps ["data", "structures"] => ["data-structures"]. Each keyword has a weight, and for similarity search I add hidden keywords: https://gitlab.com/crates.rs/crates.rs/-/blob/main/crate_db/src/lib_crate_db.rs#L306 For keyword extraction I take markdown sections into account: https://gitlab.com/crates.rs/crates.rs/-/blob/main/feat_extractor/src/lib.rs#L44 and use only never-seen-before sentences.

https://mastodon.social/@kornel/109508654611639728

tchoutri commented 1 year ago

@qw04 will take this