Using LLMs for the ultimate Cite Extraction algorithm

screen

Aaron thank you so much for maintaining all these algorithms; I have spent years reviewing and improving them and I've been able to get in contact with many interested.

My multi-prong stategy I've developed a full article extractor with cite parsing. It combines all these approaches possible to weight best candidates for author, date, source, title. Look to metadata common names, then look at dom classes and IDs. I cannot tell you how many edge cases and errors there are in practice.
Manually label every website -- Custom id/class selectors for every major news site 2000+ sites and how to remove interstititial adds https://github.com/fivefilters/ftr-site-config/tree/master This can be really useful if deployed well
Compare String Similarity - ei, so many classnames vaguely similar to 'author' or 'byline' we can create a Dice Coefficient Levenshtein algorithm for string similarity to catch similar-looking classes. working code all here: https://github.com/aceakash/string-similarity/blob/master/src/index.js function findBestMatch(mainString, targetStrings)
Human / Organization Named Entity Recognition - this is a must, and I have created a list of common named and orgs from wiki to train validation for if author is human name, which is critical to format to know if to reverse cite and to validate if this is author's name or just their twitter handle that was detected -- so many false positives with your algorithm.
Transformers.js - run small LLMs for classification in browser at binary speeds with ONNX compilation into WASM. They have lots of demos, use cases, ways to classify text here: https://huggingface.co/docs/transformers.js/index
Train a custom Zero-Shot Classifier of author/date/org etc based on Sentence BERT. Here it is a working demo: https://huggingface.co/spaces/Xenova/zero-shot-classification-demo
LLM Candidates -- Overall, there are still many false positives and edge cases that overcorrecting does more harm than good. The best bottom line is to get top candidates for each category and prompt Groq Llama3 or others.

ashtarcommunications / cite-creator

Using LLMs for the ultimate Cite Extraction algorithm #1