Closed giannisni closed 1 year ago
You don't need to use an external library for this. The built in KeepOnlyTagger will allow you to only keep the metadata you want, discarding everything else.
Τhank you, but i meant strictly the "content" field. I need to extract only the text of an article. Content thought contains multiple more elements. For example newspaper3k or boilerpipe extract solely the article text.
You can force the crawler to extract content from specific parts of the DOM with a combination of DOMDeleteTransformer or DOMPreserveTransformer
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
can you please help me how to use ExternalTransformer instead? I want the crawler to run a python script everytime with newspaper library in it. Norconex will pass just the document reference and the script will extract the correct content from the article. Then it will return the content to the crawler so to be indexed I am really confused by the documentation
This blog post might help
https://norconex.com/a-no-code-solution-for-extending-norconex-file-system-crawler/
Hello! I want to extract the text from news articles. Originally norconex does this and puts it to content field, but it also takes other html tags like categories, time or relative news. My thought was to use an external library like newspaper3k using it from a python script that i call it from ExternalTransformer which then stores the text in a new field. Is that a good practice? Please help me with configurations of ExternalTransformer as I am really confused. I already implemented the python script that takes as input from arguments a url and extracts the text.