eccsup / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

de.tudarmstadt.ukp.wikipedia.parser.Link.getText may return empty string #96

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I noticed that when a page has categories as follows, getText() will return an 
empty string. Take for example, the 'Anarchism' page. It has six categories 
defined in its wikitext:
[[Category:Anarchism| ]]
[[Category:Political culture]]
[[Category:Political ideologies]]
[[Category:Social theories]]
[[Category:Anti-fascism]]
[[Category:Greek loanwords]]

The following code 
for (Link link : page.getCategories()) {
  System.out.println(">" + link.getText() + "<");
}

will print:
><
>Category:Political culture<
>Category:Political ideologies<
>Category:Social theories<
>Category:Anti-fascism<
>Category:Greek loanwords<

Note the first line. We get an empty text because the string after the | 
character is empty.

I suggest that in such a case, we return the category "target" itself or the 
target without the "Category:" string.

What version of the product are you using? On what operating system?
Running latest release (0.9.1) on Linux.

Original issue reported on code.google.com by jbab...@gmail.com on 20 May 2012 at 12:28

GoogleCodeExporter commented 9 years ago
more of a request for enhancement..

Original comment by jbab...@gmail.com on 20 May 2012 at 12:28

GoogleCodeExporter commented 9 years ago
Thanks for the report. I will look into it and make the suggested change.

However, be aware that as of the next release of JWPL, the parser will not be 
supported any more. It has been moved into its own module.
We will still apply patches provided by the community, but we will not develop 
the parser any further.
We now use the Sweble parser (www.sweble.org), which we also integrated into 
JWPL Core.

Original comment by oliver.ferschke on 29 May 2012 at 10:17

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 29 May 2012 at 10:23