Open vtempest opened 4 months ago
Sounds interesting - have you released a version of the article extractor somewhere?
Agreed, there's a lot of room for improvement in the cite extraction algorithm used here - the extension uses a pretty naive approach. I think the main value of the extension currently is being able to quickly construct your own cite using keyboard shortcuts, since the initial accuracy is pretty low in a lot of cases.
Automatically closing as stale.
Aaron thank you so much for maintaining all these algorithms; I have spent years reviewing and improving them and I've been able to get in contact with many interested.
My multi-prong stategy I've developed a full article extractor with cite parsing. It combines all these approaches possible to weight best candidates for author, date, source, title. Look to metadata common names, then look at dom classes and IDs. I cannot tell you how many edge cases and errors there are in practice.
Manually label every website -- Custom id/class selectors for every major news site 2000+ sites and how to remove interstititial adds https://github.com/fivefilters/ftr-site-config/tree/master This can be really useful if deployed well
Compare String Similarity - ei, so many classnames vaguely similar to 'author' or 'byline' we can create a Dice Coefficient Levenshtein algorithm for string similarity to catch similar-looking classes. working code all here: https://github.com/aceakash/string-similarity/blob/master/src/index.js function findBestMatch(mainString, targetStrings)
Human / Organization Named Entity Recognition - this is a must, and I have created a list of common named and orgs from wiki to train validation for if author is human name, which is critical to format to know if to reverse cite and to validate if this is author's name or just their twitter handle that was detected -- so many false positives with your algorithm.
Transformers.js - run small LLMs for classification in browser at binary speeds with ONNX compilation into WASM. They have lots of demos, use cases, ways to classify text here: https://huggingface.co/docs/transformers.js/index
Train a custom Zero-Shot Classifier of author/date/org etc based on Sentence BERT. Here it is a working demo: https://huggingface.co/spaces/Xenova/zero-shot-classification-demo
LLM Candidates -- Overall, there are still many false positives and edge cases that overcorrecting does more harm than good. The best bottom line is to get top candidates for each category and prompt Groq Llama3 or others.