Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

ExternalTransformer #845

Closed giannisni closed 1 year ago

giannisni commented 1 year ago

Hello! I want to extract the text from news articles. Originally norconex does this and puts it to content field, but it also takes other html tags like categories, time or relative news. My thought was to use an external library like newspaper3k using it from a python script that i call it from ExternalTransformer which then stores the text in a new field. Is that a good practice? Please help me with configurations of ExternalTransformer as I am really confused. I already implemented the python script that takes as input from arguments a url and extracts the text.

ohtwadi commented 1 year ago

You don't need to use an external library for this. The built in KeepOnlyTagger will allow you to only keep the metadata you want, discarding everything else.

giannisni commented 1 year ago

Τhank you, but i meant strictly the "content" field. I need to extract only the text of an article. Content thought contains multiple more elements. For example newspaper3k or boilerpipe extract solely the article text.

ohtwadi commented 1 year ago

You can force the crawler to extract content from specific parts of the DOM with a combination of DOMDeleteTransformer or DOMPreserveTransformer

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

giannisni commented 11 months ago

can you please help me how to use ExternalTransformer instead? I want the crawler to run a python script everytime with newspaper library in it. Norconex will pass just the document reference and the script will extract the correct content from the article. Then it will return the content to the crawler so to be indexed I am really confused by the documentation

ohtwadi commented 11 months ago

This blog post might help

https://norconex.com/a-no-code-solution-for-extending-norconex-file-system-crawler/