Feature Request - api to return character offsets of non-boilerplate text

GoogleCodeExporter commented 9 years ago

The Highlighter returns the non-boilerplate text. Is there a way to return the 
character offsets of the non-boilerplate text in the original HTML? That would 
be very useful for me.

Currently, the tool is quite useful as a pre-processor that you pass HTML into 
and get back clean plaintext, which you can then pass to an indexing pipeline. 

I need to take this a step furthr and be able to mark up a HTML page with 
"interesting terms", ie terms that I find in my controlled vocabulary. So I 
figured that I could use boilerpipe in this manner:

1) pass boilerpipe to the HTML highlighter
2) find non-boilerplate text in the HTML (ie character offsets, begin and end 
blocks).
3) pass each of these blocks into my application that finds matches in my 
controlled vocabulary and record character offsets.
4) return the original HTML page decorated with the annotations from my 
controlled vocabulary (using offsets found in 2 and 3 to compute the positions 
to decorate).

Currently the closest I can get to this is via the highlighter. But I dont see 
a way to get the character positions from the highlighted text.

Any pointers, suggestions, or a new API to do this would be greatly appreciated.

I am using boilerpipe-1.1.0.

Thanks very much,
Sujit

Original issue reported on code.google.com by sujitatg...@gmail.com on 19 Jun 2011 at 9:25

GoogleCodeExporter commented 9 years ago

Sorry, in step (1) above, it should read:
1) pass HTML to boilerpipe's HTML highlighter

Original comment by sujitatg...@gmail.com on 19 Jun 2011 at 9:28

GoogleCodeExporter commented 9 years ago

Matching at term-level is out of scope for boilerpipe.

See Lucene's Highlighter class for a starting point:
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apac
he/lucene/search/highlight/Highlighter.html

Original comment by ckkohl79 on 21 Mar 2012 at 9:30

Changed state: WontFix
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Hi, if this is not a complete reject... :-)

I am not asking for matching at term level for boilerpipe. I am asking for 
character offsets (wrt the input text) in the non-boilerplate output returned 
by boilerpipe (ie step 2 in the original post).

So assuming an input text:
# 0                  1                  2                  3                  4 
                 5
# 012345678901234567890123456789012345678901234567890123456789
   BOILERPLATEsome good textMORE BOILERPLATE....

the output of boilerpipe is:
some good text

I am asking for a way to say that "some good text"  starts at position 10 and 
ends at position 21 in the original text. The rest I can do in my application.

I ask because I think this is already being done by the boilerpipe highlighter, 
so the information exists, but I couldn't figure out a way to get to it.

Thanks again,
Sujit

Original comment by sujitatg...@gmail.com on 21 Mar 2012 at 11:09

aschaeffer / boilerpipe

Feature Request - api to return character offsets of non-boilerplate text #25