Javascript Version has landed. 🚀

vtempest commented 2 months ago

I have translated all 21 files into javascript with npm libs like


    "axios": "^0.21.1",
    "chardet": "^1.3.0",
    "cheerio": "^1.0.0-rc.10",
    "commander": "^8.0.0",
    "html-escaper": "^3.0.3",
    "jsdom": "^16.6.0",
    "langid": "^0.0.1"

As well as courlan and htmldate. I am hoping to test out and integrate into https://airesearch.wiki/

🚀 🚀 Also improve the accuracy with 1000+ site specific selectors https://github.com/fivefilters/ftr-site-config

adbar commented 2 months ago

@vtempest I don't understand, you don't mention Trafilatura in the docs or in the code, is this a fork or an alternative?

According to extract-content.js you're using Postlight and Readability.

As a side note, if you wish to integrate Python packages in a JS environment there are tools for that.

vtempest commented 2 months ago

PR Request: https://github.com/adbar/trafilatura/pull/692 JS Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html

I have translated 35 py files to js from trafilatura, and also htmldate and courlan. Removes cli things and relies only on these NPMs: "axios": "^0.21.1", "xpath": "^0.0.34", "jsdom": "^16.6.0",

History of HTML Content Extraction: Tractor the Text Extractor (2024) (based on this js port) Trafilatura (2020-) Mozilla (2015-) Arc90 (2010)

I know it's possible to integrate Python into servers called from JS, but it is much simpler to have a front-end only JS function extract(html) and that is what it should have been to begin with (or even Rust compiled into wasm running in js) to make it easier to use from chrome extensions and mobile apps without need for a server running.

Code And Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html

More Testing Needed

Functions translated but the flow is printing outputs but not getting the main content. You're the only one in the world who knows why. Lets work together on making this happen and then build awesome LLM addons later.

Future Ideas Plan

My multi-prong stategy I've developed a full article extractor with cite parsing. It combines all these approaches possible to weight best candidates for author, date, source, title. Look to metadata common names, then look at dom classes and IDs. I cannot tell you how many edge cases and errors there are in practice.
Manually label every website -- Custom id/class selectors for every major news site 2000+ sites and how to remove interstititial adds https://github.com/fivefilters/ftr-site-config/tree/master This can be really useful if deployed well
Compare String Similarity - ei, so many classnames vaguely similar to 'author' or 'byline' we can create a Dice Coefficient Levenshtein algorithm for string similarity to catch similar-looking classes. working code all here: https://github.com/aceakash/string-similarity/blob/master/src/index.js function findBestMatch(mainString, targetStrings)
Human / Organization Named Entity Recognition - this is a must, and I have created a list of common named and orgs from wiki to train validation for if author is human name, which is critical to format to know if to reverse cite and to validate if this is author's name or just their twitter handle that was detected.
Transformers.js - run small LLMs for classification in browser at binary speeds with ONNX compilation into WASM. They have lots of demos, use cases, ways to classify text here: https://huggingface.co/docs/transformers.js/index
Train a custom Zero-Shot Classifier of author/date/org etc based on Sentence BERT. Here it is a working demo: https://huggingface.co/spaces/Xenova/zero-shot-classification-demo
LLM Candidates -- Overall, there are still many false positives and edge cases that overcorrecting does more harm than good. The best bottom line is to get top candidates for each category and prompt Groq Llama3 or others

Vielen Dank!

adbar commented 2 months ago

Hi @vtempest, now I see, this is impressive.

However I believe it is out of scope for the current state of the repository. I'm developing and maintaning a Python package and I lack the time to test and evaluate the code you translated into JavaScript.

In essence you've forked the software, I recommend you to publish it as a trafilatura-js package (also htmldate-js etc.), see for instance the Go port go-trafilatura.

That way the users know what it's about and get a chance to use it without importing your whole research agent software. You could also import and use your fork there. The idea is that if you split the work you may get maintainers to help you on selected parts – including me, I'll check it out if I get a chance.

adbar / trafilatura

Javascript Version has landed. 🚀 #688

More Testing Needed

Future Ideas Plan