adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.62k stars 258 forks source link

Javascript Version has landed. 🚀 #688

Closed vtempest closed 1 month ago

vtempest commented 2 months ago

I have translated all 21 files into javascript with npm libs like


    "axios": "^0.21.1",
    "chardet": "^1.3.0",
    "cheerio": "^1.0.0-rc.10",
    "commander": "^8.0.0",
    "html-escaper": "^3.0.3",
    "jsdom": "^16.6.0",
    "langid": "^0.0.1"

As well as courlan and htmldate. I am hoping to test out and integrate into https://airesearch.wiki/

🚀 🚀 Also improve the accuracy with 1000+ site specific selectors https://github.com/fivefilters/ftr-site-config

adbar commented 2 months ago

@vtempest I don't understand, you don't mention Trafilatura in the docs or in the code, is this a fork or an alternative?

According to extract-content.js you're using Postlight and Readability.

As a side note, if you wish to integrate Python packages in a JS environment there are tools for that.

vtempest commented 2 months ago

PR Request: https://github.com/adbar/trafilatura/pull/692 JS Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html

I have translated 35 py files to js from trafilatura, and also htmldate and courlan. Removes cli things and relies only on these NPMs: "axios": "^0.21.1", "xpath": "^0.0.34", "jsdom": "^16.6.0",

History of HTML Content Extraction: Tractor the Text Extractor (2024) (based on this js port) Trafilatura (2020-) Mozilla (2015-) Arc90 (2010)

I know it's possible to integrate Python into servers called from JS, but it is much simpler to have a front-end only JS function extract(html) and that is what it should have been to begin with (or even Rust compiled into wasm running in js) to make it easier to use from chrome extensions and mobile apps without need for a server running.

Code And Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html

More Testing Needed

Functions translated but the flow is printing outputs but not getting the main content. You're the only one in the world who knows why. Lets work together on making this happen and then build awesome LLM addons later.

Future Ideas Plan

adbar commented 2 months ago

Hi @vtempest, now I see, this is impressive.

However I believe it is out of scope for the current state of the repository. I'm developing and maintaning a Python package and I lack the time to test and evaluate the code you translated into JavaScript.

In essence you've forked the software, I recommend you to publish it as a trafilatura-js package (also htmldate-js etc.), see for instance the Go port go-trafilatura.

That way the users know what it's about and get a chance to use it without importing your whole research agent software. You could also import and use your fork there. The idea is that if you split the work you may get maintainers to help you on selected parts – including me, I'll check it out if I get a chance.