Parser includes some wiki artefacts in page.getPlainText(); method

GoogleCodeExporter commented 9 years ago

Hi,
I stumbled across some strange results iterating over the dewiki-20070206.sql 
database using the page.getPlainText() method  and counting all words with JWPL:

the most common words in Wikipedia contained:
count:    Word

842554: TEMPLATE
629957: Kategorie
438822: nbsp
256422: thumb
253580: Weblinks

also later on some all uppercase words that where probably part of some 
template?
http://de.wikipedia.org/wiki/Vorlage:Personendaten

152323: NAME
131362: KURZBESCHREIBUNG
131333: GEBURTSDATUM
131300: GEBURTSORT
131237: ALTERNATIVNAMEN

and so on.

What is the expected output? What do you see instead?
I expected just normal German words to be the most common words like: 
6424690: der; 5262887: und; 4316465: die; 3693242: in; 2713375: von; 1888157: 
den; 1806153: des; 1578301: mit; 1509434: im; 1466509: ist; 1348733: Die; 
1254947: zu; 1219600: das; 1218469: dem; 1110328: als; 1083261: für; 1077734: 
auf; 1075940: eine; 1046970: ein; 1011403: wurde; 1009821: sich; 910366: er; 
881106: auch; 842554: TEMPLATE; 814815: an; 714727: aus; 701011: war; 675874: 
Der; 654112: nach; 629957: Kategorie; 616429: bei; 589324: wird; 581816: einer; 
573699: werden; 547424: bis; 530476: sind; 529210: nicht; 525816: durch; 
520091: oder; 518637: am; 503813: 1; 503254: zum; 481658: sie; 466585: es; 
446827: Das; 438822: nbsp;

and so on, as you can see the above mentioned Words got mixed with my results.

What version of the product are you using? On what operating system?
jwpl_v0.5, Ubuntu 10.04 LTS

Steps to reproduce:

Iterate over the whole database,
Tokenize every article with lucene standard tokenizer,
count all tokens with a TreeMap<String, Integer>
when finished put all values in a MultiValueMap and use the count as key, 
(basicly: <Integer, ArrayList<String>>)
get a copy of the keySet, sort it, use the top x values as keys to retrieve 
words from the map.

Original issue reported on code.google.com by SamyAt...@googlemail.com on 12 Jan 2011 at 1:25

GoogleCodeExporter commented 9 years ago

Hi

First of all, this is an "API user comment".

I am not sure whether this is a bug (the comment related to page.getPlainText() 
says "return article WITHOUT all wiki markup", but the description says "WITH 
all wiki markup).

In the meanwhile, you can solve the problem by following the process described 
in the tutorial "T5_CleaningTemplateImage.java", or give a look to 
de.tudarmstadt.ukp.wikipedia.parser.mediawiki 

Regards,
Alberto

Original comment by albar...@gmail.com on 10 Feb 2011 at 5:19

GoogleCodeExporter commented 9 years ago

Original comment by torsten....@gmail.com on 1 Jun 2011 at 6:20

Added labels: Priority-High
Removed labels: Priority-Medium

coriane / jwpl

Parser includes some wiki artefacts in page.getPlainText(); method #7