Closed BluflameSec closed 7 years ago
You are best to use an Importer handler. For instance, the TextPatternTagger will enable you to extract patterns matching a regular expression and store it in a field of your choice. Like this:
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
<pattern field="myEmailField">
(email regex pattern here)
</pattern>
</tagger>
</postParseHandlers>
</importer>
If it is really the content you want to modify, you have to rely on transformers, like ReplaceTransformer.
Will this output the emails to a file? Im not using an importer and just storing data on the local system
And which regex pattern should I use, I tried using one but its giving me an error.
What pattern have you tried? There are many regex patterns you can use to match emails, depending how precise you want the matching to be, a quick web search will list a few, try this. A simple one is \S+@\S+\.\S+
but it may also detect invalid emails (a bit too permissive).
You have to use a Committer to store crawled documents where and how you want. There is a list here. Out of the box, you can use the FileSystemCommitter which will store everything as files. Each document will have a few files created, one containing the text content and another one containing all extracted fields (in Java Properties file format). This may not suite your needs perfectly. If so, you are encouraged to write your own Committer that does exactly what you want.
All I need it to do is grab the emails and dump them into the crawled files section, would that be to hard to do?
If you do not mind having to go through many crawled files, no, that should not be too hard. You can use the configuration approach I mentioned earlier (TextPatternTagger) with the FilesystemCommitter. Then look at all generated files ending with .meta
. They will contain the field "myEmailField" from my previous example, with the email.
If you just want to have that "myEmailField" in the .meta
file, you can add this right after the TextPatternTagger tagger configuration:
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>myEmailField</fields>
</tagger>
Hey guys, could you provide me with a config for filtering out content except email addresses? I cant seem to figure this one out. Thanks!