Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Config for parseing email addresses #306

Closed BluflameSec closed 7 years ago

BluflameSec commented 8 years ago

Hey guys, could you provide me with a config for filtering out content except email addresses? I cant seem to figure this one out. Thanks!

essiembre commented 8 years ago

You are best to use an Importer handler. For instance, the TextPatternTagger will enable you to extract patterns matching a regular expression and store it in a field of your choice. Like this:

<importer>
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
          <pattern field="myEmailField">
              (email regex pattern here)
          </pattern>
        </tagger>
    </postParseHandlers>
</importer>

If it is really the content you want to modify, you have to rely on transformers, like ReplaceTransformer.

BluflameSec commented 8 years ago

Will this output the emails to a file? Im not using an importer and just storing data on the local system

BluflameSec commented 8 years ago

And which regex pattern should I use, I tried using one but its giving me an error.

essiembre commented 8 years ago

What pattern have you tried? There are many regex patterns you can use to match emails, depending how precise you want the matching to be, a quick web search will list a few, try this. A simple one is \S+@\S+\.\S+ but it may also detect invalid emails (a bit too permissive).

You have to use a Committer to store crawled documents where and how you want. There is a list here. Out of the box, you can use the FileSystemCommitter which will store everything as files. Each document will have a few files created, one containing the text content and another one containing all extracted fields (in Java Properties file format). This may not suite your needs perfectly. If so, you are encouraged to write your own Committer that does exactly what you want.

BluflameSec commented 8 years ago

All I need it to do is grab the emails and dump them into the crawled files section, would that be to hard to do?

essiembre commented 8 years ago

If you do not mind having to go through many crawled files, no, that should not be too hard. You can use the configuration approach I mentioned earlier (TextPatternTagger) with the FilesystemCommitter. Then look at all generated files ending with .meta. They will contain the field "myEmailField" from my previous example, with the email.

If you just want to have that "myEmailField" in the .meta file, you can add this right after the TextPatternTagger tagger configuration:

  <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
      <fields>myEmailField</fields>
  </tagger>