extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.55k stars 132 forks source link

Rewrite relative image URLs to absolute ones #321

Closed ghost closed 1 year ago

ghost commented 1 year ago

Is it possible to add something like puppeteer to handle Youtube/Twitter contents that requires Javascript?

FYI: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce

SettingDust commented 1 year ago

Is it possible to add something like puppeteer to handle Youtube/Twitter contents that requires Javascript?

FYI: gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce

No. It's slow and unsafe. You can input the html as arg to extract by yourself.

ghost commented 1 year ago

Does it run in the browser like Readability({}, document).parse();?

SettingDust commented 1 year ago

Does it run in the browser like Readability({}, document).parse();?

There is example for you https://github.com/ndaidong/article-parser#browser https://github.com/ndaidong/article-parser#extract

ghost commented 1 year ago

Thanks!!

Just one more question. Can it rewrite relative image URLs to absolute ones, if I directly put HTML tags instead of page URL into extract?

If it can, do I need to somehow configure the original URL of the page?

SettingDust commented 1 year ago

article-parser absolutify all <a> or <img> https://github.com/ndaidong/article-parser/blob/main/src/utils/linker.js#L109-L128

article-parser will try to find out the best url. https://github.com/ndaidong/article-parser/blob/main/src/utils/linker.js#L16-L19

@ndaidong Why not the extract(html, url) be exported?

ndaidong commented 1 year ago

@nick008a if you call extract(html), it will automatically find original URL from meta tags.

As your issue, I see that it's not perfect and developer should be able to pass original URL into extract metho manually.

ndaidong commented 1 year ago

@SettingDust we just posted in the same time. Yes, that's problem we need to resolve!

ndaidong commented 1 year ago

@SettingDust I have some ideas:

  1. Expose extractFromHTML as public method

  2. At current extract method, do the checks.

if input is URL:
  extract with URL as normal
else:
  treat first argument as HTML, third parameter (`fetchOptions`) as orginal URL 

How do you think?

SettingDust commented 1 year ago

How about overloads. These should in index.d.ts. In the javascript still have to check the input using condition like 'url' in input. And I think maybe this should be a new issue?

export function extract(input: { url: string }, options: ParserOptions)
export function extract(input: { html: string }, options: ParserOptions)
export function extract(input: { url: string, html: string }, options: ParserOptions)
// @deprecated: the old logical
export function extract(input: string, options: ParserOptions)
ndaidong commented 1 year ago

@SettingDust that's good idea, except of changing api, we may need to break to new major version.

ndaidong commented 1 year ago

@SettingDust if we start a new major version, I suggest to:

export function extract(input: { 
  url: string, 
  html: string, 
  parserOptions: ParserOptions, 
  fetchOptions: FetchOptions
})

So it woud be more flexible for the updates in future.

SettingDust commented 1 year ago

How about

export function extract(options: { 
  url: string, 
  html: string, 
  parser: ParserOptions, 
  fetch: FetchOptions
})

And some maybe helpful https://github.com/SettingDust/article-extractor/tree/main/src. I like typescript. Good point

ndaidong commented 1 year ago

@SettingDust yes, I'm moving to Deno.

SettingDust commented 1 year ago

URL import is experiment in node. Maybe it will work both platform? Prefer Deno

ndaidong commented 1 year ago

@SettingDust still learning, I see many libs are being built for both platforms.

ndaidong commented 1 year ago

@SettingDust you are developing a completely new variant. Why we don't simple create an organization together, and move this lib to there? With more than 700 stars, this lib should not belong to an individual developer. It should be @organization/article-parser.

SettingDust commented 1 year ago

It's fine

ndaidong commented 1 year ago

@SettingDust any idea about organization name?

SettingDust commented 1 year ago

@SettingDust any idea about organization name?

article-parser/article-parser? lul

ndaidong commented 1 year ago

@SettingDust it should cover more tools, I would like to move all its sibling too there: feed-reader, oembed-parser. And few more extractor tools in future: meta extractor, images extractor, price extractor...

SettingDust commented 1 year ago

In my own lib. I make it accept custom extractor to extract different metas. And the images and price should extractable with custom extractor. So, maybe "meta-extractor/meta-extractor"?

ndaidong commented 1 year ago

@SettingDust how about something strange, meaningless, latin style like extractorius or extractus? In this organization will contain the repos name as x-extractor, and people can install them as @extractus/x-extractor.

ndaidong commented 1 year ago

@SettingDust I've added you as owner, and testing transfer function.

https://github.com/extractus

ghost commented 1 year ago

if we start a new major version

Is there any workaround in the current version? Like, setting entry.url somewhere?

ndaidong commented 1 year ago

@nick008a if page HTML does not contain url meta tags, you may need to add them manually (using DOM manipulation).

It expects one of these lines:

<meta property="og:url" content="https://orginal-url.com/category/article/slug">
<meta property="twitter:url" content="https://orginal-url.com/category/article/slug">
ghost commented 1 year ago

(using DOM manipulation)

Does it matter where I insert them? Can I simply concatenate the string?

ndaidong commented 1 year ago

@nick008a of course, they must be placed between <head></head>. It's regular meta tag as you often see at the SEO standard websites.

Screenshot from 2022-11-29 15-42-27