Open tchoutri opened 2 years ago
We can extract tags from packages using RAKE. This will require tuning, filtering and more automated filtering. The datalog can be useful.
This requires the following:
In terms of normalisation, we can learn a great deal from lib.rs:
I normalize keywords to kebab-case, except CJK and a few exceptions like "iOS" which looks silly. I had to manage synonyms mostly manually: https://gitlab.com/crates.rs/crates.rs/-/blob/main/data/tag-synonyms.csv Joining adjacent keywords into pairs helps ["data", "structures"] => ["data-structures"]. Each keyword has a weight, and for similarity search I add hidden keywords: https://gitlab.com/crates.rs/crates.rs/-/blob/main/crate_db/src/lib_crate_db.rs#L306 For keyword extraction I take markdown sections into account: https://gitlab.com/crates.rs/crates.rs/-/blob/main/feat_extractor/src/lib.rs#L44 and use only never-seen-before sentences.
https://mastodon.social/@kornel/109508654611639728
@qw04 will take this
We can extract tags from packages using RAKE. This will require tuning, filtering and more automated filtering. The datalog can be useful.
This requires the following:
In terms of normalisation, we can learn a great deal from lib.rs:
https://mastodon.social/@kornel/109508654611639728