Closed akshaybijawe closed 7 years ago
The separator you see it part of the internal storage format and shows all values were extracted properly in multi-value fields.
Is your goal to create fields with their name matching the value of info-title
and their value matching the next info-text
(a different field for each pair)? I am afraid the DOMTagger can't do that right now.
What you can do is use or create a specific Committer suited to what you want to do with the data. The values would come as arrays and you could rely on the position of each item to match fields and values.
Since you are already using the HTTP Collector with Java it may be easier to write your own ICommitter
or your own IDocumentTagger
to extract values and add fields exactly how you want them.
If you have to do it through configuration, you can look at using the ScriptTagger for more flexibility.
We can also turn this ticket into a feature request if you want to have the DOMTagger (or new tagger) handle cases like yours.
Alternatively, if the info-title
are always the same in each pages, then you can use something like TextPatternTagger with multiple patterns, one for each type of pairs you want, hardcoding the target field names you want.
Make sense?
Hi Pascal, thank you for the detailed explanation. Yes, I would like to create fields with their name matching the value of info-title and their value matching info-text. Apart from these, there are different fields in that page (some in the form of table tags, div etc.) which I would also like to extract. Also, I see that the .cntnt file for crawled pages include all of the content from the page including the ones from the tag. I don't know if it would make more sense to just parse through that .cntnt file or write my own ICommitter or IDocumentTagger as you mentioned above. It would be great if this could be turned into a feature. But in the meantime, I will explore the options that you suggested. Thank you again.
Marking this as a feature request to be able to extract both field names and values from DOM and/or patterns.
Thank you, Pascal.
Since this feature requests belong to the Importer module, I am closing this in favor of one I created there: https://github.com/Norconex/importer/issues/52
meta.txt
config.txt Hi Pascal, I have a question about this DOMTagger implementation. This is link that I want to crawl. This is a snippet from the html:
This is my config:
In the .meta file, I get the following output. Below is the snippet of .meta file.
Now my question is, how to extract these fields separately? Also, may I know how to implement this in Java. I have tried this:
Do we need to configure this dom in HTTPCollectorConfig or HTTPCrawlerConfig? Thank you.