Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Feature request for com.norconex.importer.handler.tagger.impl.DOMTagger #26

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi again,

description of the problem I'm trying to extract the people from pages like: http://finder.startupnationcentral.org/c/polymertal For this purpose, I'm using the following code:

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content]>div[class~=company-team] div[class~=company-team__photo-wrapper]>img" toField="MEMBER-IMAGE"
                 overwrite="true"
                 extract="attr(src)" />
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content]>div[class~=company-team]>*[class~=company-team__member]>div[class~=company-team__info]>div[class~=company-team__name]" toField="MEMBER-NAME"
                 overwrite="true"
                 extract="text" />
            <dom selector="h2[class~=section__title]:containsOwn(The Team)+div[class~=section__content] div[class~=company-team]>*[class~=company-team__member]>div[class~=company-team__info]>div[class~=company-team__position]" toField="MEMBER-POSITION"
                 overwrite="true"
                 extract="text" />
</tagger>

Then, I developed a committer that reconstructs a list of 3 "Person" objects (each one having 3 fields: name, position, image) based on the lists IMAGE, NAME, POSITION that have length 3. The problem is that on some pages, the IMAGE or POSITION might lack. As a consequence, the IMAGE or POSITION lists might contain less values than the NAME list. In such cases, I'm unable to reconstruct the "Person" objects, since I don't know which value in the NAME list correspond to which value in the IMAGE & POSITION lists.

feature request As a consequence, I would love to be able to specify to some of the DOMTagger entries that I want them to return an empty string when they do not match the selector. This is the simplest solution I can think of to ensure that I have always the correct number of values in each lists.

PS: If you have any idea on how I can better solve this problem, I'm all ears.

BTW, I just realized I'm using "overwrite=true" everywhere and that I still get multiple values, so there might be a bug here :)

essiembre commented 8 years ago

This is not possible the way DOM elements are matched. See, it is not like it will first match a parent tag, then apply your 3 DOM directives to it (where we would insert the blanks you want), and then move to the next one, etc. Each <dom> entry is independent of one another so they each match what they can, not knowing how many matches you think should be in total based on some other criteria. I mean, if one only has 3 matches when it should have 4 let's say, well it does not know there should have been 4.

It is for a similar reason there is no bug when you get multiple values even if you have overwrite "true". Because a single DOM directive can match multiple entries and that's the result of a single operation (1 execution matching 3 values as opposed to 3 executions matching 1 value each). If you have another DOMTagger that matches something different and want it in the same field, then the overwrite would apply and your new set of values would overwrite the previous set.

So you are probably best to use another approach. Not sure which one is best/easier for you, but here are a few ideas:

Hopefully one of these solutions will work for you. Let me know what you ended up choosing.

liar666 commented 8 years ago

Hi thanks for the quick and very clear and detailed answers!

I was indeed thinking about doing something like your 3rd proposition. I was even wandering if I could directly apply a DOMTagger on the restricted "MEMBER-HTML" field instead of using regexes (with a ReplaceTagger)(*)?

Indeed, since, my committer rebuilds a "Company" object that is composed of a set of "members", each of these "members" (having a "name", a "picture" and a "position") and an (optional) "parent" company, it would be great to be able to have some kind of "recursive"/Divide&Conquer approach, where I would write an "organization" tagger that splits the original html into pieces: one for each "member" (on which I could apply a "member" tagger) and one for the "parent" organization (on which I could apply the same "organization" tagger as for the original html page).

(*) I imagine that could be done adding a "fromField" to the DOMTagger in the same way as for the ReplaceTagger?

essiembre commented 8 years ago

I like that idea, but since fields/metadata properties are "flat" (non-hierarchical), how would you represent that safely? In any case, I like the idea of having a fromField for DOMTagger. Since you closed this one, I encourage you to open a new ticket about this new feature you are suggesting. I'm taking note of it regardless. :-)