Closed ghost closed 1 year ago
Is it possible to add something like puppeteer to handle Youtube/Twitter contents that requires Javascript?
No. It's slow and unsafe. You can input the html as arg to extract
by yourself.
Does it run in the browser like Readability({}, document).parse();
?
Does it run in the browser like
Readability({}, document).parse();
?
There is example for you https://github.com/ndaidong/article-parser#browser https://github.com/ndaidong/article-parser#extract
Thanks!!
Just one more question. Can it rewrite relative image URLs to absolute ones, if I directly put HTML tags instead of page URL into extract?
If it can, do I need to somehow configure the original URL of the page?
article-parser absolutify all <a>
or <img>
https://github.com/ndaidong/article-parser/blob/main/src/utils/linker.js#L109-L128
article-parser will try to find out the best url. https://github.com/ndaidong/article-parser/blob/main/src/utils/linker.js#L16-L19
@ndaidong Why not the extract(html, url)
be exported?
@nick008a if you call extract(html)
, it will automatically find original URL from meta tags.
As your issue, I see that it's not perfect and developer should be able to pass original URL into extract
metho manually.
@SettingDust we just posted in the same time. Yes, that's problem we need to resolve!
@SettingDust I have some ideas:
Expose extractFromHTML
as public method
At current extract
method, do the checks.
if input is URL:
extract with URL as normal
else:
treat first argument as HTML, third parameter (`fetchOptions`) as orginal URL
How do you think?
How about overloads. These should in index.d.ts
. In the javascript still have to check the input using condition like 'url' in input
.
And I think maybe this should be a new issue?
export function extract(input: { url: string }, options: ParserOptions)
export function extract(input: { html: string }, options: ParserOptions)
export function extract(input: { url: string, html: string }, options: ParserOptions)
// @deprecated: the old logical
export function extract(input: string, options: ParserOptions)
@SettingDust that's good idea, except of changing api, we may need to break to new major version.
@SettingDust if we start a new major version, I suggest to:
export function extract(input: {
url: string,
html: string,
parserOptions: ParserOptions,
fetchOptions: FetchOptions
})
So it woud be more flexible for the updates in future.
How about
export function extract(options: {
url: string,
html: string,
parser: ParserOptions,
fetch: FetchOptions
})
And some maybe helpful https://github.com/SettingDust/article-extractor/tree/main/src. I like typescript. Good point
@SettingDust yes, I'm moving to Deno.
URL import is experiment in node. Maybe it will work both platform? Prefer Deno
@SettingDust still learning, I see many libs are being built for both platforms.
@SettingDust you are developing a completely new variant. Why we don't simple create an organization together, and move this lib to there? With more than 700 stars, this lib should not belong to an individual developer. It should be @organization/article-parser
.
It's fine
@SettingDust any idea about organization name?
@SettingDust any idea about organization name?
article-parser/article-parser
? lul
@SettingDust it should cover more tools, I would like to move all its sibling too there: feed-reader, oembed-parser. And few more extractor tools in future: meta extractor, images extractor, price extractor...
In my own lib. I make it accept custom extractor to extract different metas. And the images and price should extractable with custom extractor. So, maybe "meta-extractor/meta-extractor"?
@SettingDust how about something strange, meaningless, latin style like extractorius
or extractus
?
In this organization will contain the repos name as x-extractor, and people can install them as @extractus/x-extractor
.
@SettingDust I've added you as owner, and testing transfer function.
if we start a new major version
Is there any workaround in the current version? Like, setting entry.url
somewhere?
@nick008a if page HTML does not contain url meta tags, you may need to add them manually (using DOM manipulation).
It expects one of these lines:
<meta property="og:url" content="https://orginal-url.com/category/article/slug">
<meta property="twitter:url" content="https://orginal-url.com/category/article/slug">
(using DOM manipulation)
Does it matter where I insert them? Can I simply concatenate the string?
@nick008a of course, they must be placed between <head></head>
. It's regular meta tag as you often see at the SEO standard websites.
Is it possible to add something like puppeteer to handle Youtube/Twitter contents that requires Javascript?
FYI: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce