Closed vtempest closed 1 month ago
@vtempest I don't understand, you don't mention Trafilatura in the docs or in the code, is this a fork or an alternative?
According to extract-content.js you're using Postlight and Readability.
As a side note, if you wish to integrate Python packages in a JS environment there are tools for that.
PR Request: https://github.com/adbar/trafilatura/pull/692 JS Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html
I have translated 35 py files to js from trafilatura, and also htmldate and courlan. Removes cli things and relies only on these NPMs: "axios": "^0.21.1", "xpath": "^0.0.34", "jsdom": "^16.6.0",
History of HTML Content Extraction: Tractor the Text Extractor (2024) (based on this js port) Trafilatura (2020-) Mozilla (2015-) Arc90 (2010)
I know it's possible to integrate Python into servers called from JS, but it is much simpler to have a front-end only JS function extract(html) and that is what it should have been to begin with (or even Rust compiled into wasm running in js) to make it easier to use from chrome extensions and mobile apps without need for a server running.
Code And Docs: https://airesearch.wiki/classes/src_extract_content_core.Extractor.html
Functions translated but the flow is printing outputs but not getting the main content. You're the only one in the world who knows why. Lets work together on making this happen and then build awesome LLM addons later.
LLM Candidates -- Overall, there are still many false positives and edge cases that overcorrecting does more harm than good. The best bottom line is to get top candidates for each category and prompt Groq Llama3 or others
Vielen Dank!
Hi @vtempest, now I see, this is impressive.
However I believe it is out of scope for the current state of the repository. I'm developing and maintaning a Python package and I lack the time to test and evaluate the code you translated into JavaScript.
In essence you've forked the software, I recommend you to publish it as a trafilatura-js
package (also htmldate-js
etc.), see for instance the Go port go-trafilatura
.
That way the users know what it's about and get a chance to use it without importing your whole research agent software. You could also import and use your fork there. The idea is that if you split the work you may get maintainers to help you on selected parts – including me, I'll check it out if I get a chance.
I have translated all 21 files into javascript with npm libs like
As well as courlan and htmldate. I am hoping to test out and integrate into https://airesearch.wiki/
🚀 🚀 Also improve the accuracy with 1000+ site specific selectors https://github.com/fivefilters/ftr-site-config