Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

How to remove navigation menus from HTTP source #21

Closed butchersoft1 closed 4 years ago

butchersoft1 commented 4 years ago

Hello,

I have added a document pre-parser to remove any content that appears between

but cannot get the function to work as a regex. If I add "Skip" "content" then it will "Skip this menu between the content" fine. ]]> ]]> Using the latest version of http collector
essiembre commented 4 years ago

By default regex asterisk is "greedy". Try with <nav.*?> instead. I recommend you use an online regular expression tool for testing. If you suspect the problem is something else, feel free to reopen.

butchersoft1 commented 4 years ago

Thanks I'll give it a go, but I have tried many different regex catch and none have worked (where in the past they used to). I reverted to adding a text tags before and after every section I need removed and just going with a flat text extract between <!--ignore-start> and <!--ignore-end>.

I'm trying to remove repeated headers, footer and menu sections. If you have some advice on best practice the help would be appreciated

essiembre commented 4 years ago

You can check here how adding ? matches just what you want to strip here: https://regex101.com/r/94oXsw/2

If you suspect an issue, please re-open with an HTML sample to reproduce.