attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.75k stars 967 forks source link

"code" tags changed to "codice_N"? #116

Open BrenBarn opened 7 years ago

BrenBarn commented 7 years ago

What is up with the behavior whereby any article text inside <code> tags is replaced by a single "word" of the form "codice_N", where N is the number of the code tag. That is, the first <code> tag is replaced by "codice_1", the second by "codice_2", etc. Looking at the code, it appears the same will be done for <math> tags (replace by "formula_N").

This doesn't appear to be documented and it's unclear what the purpose of it is. It makes the extracted output quite useless for some articles (like those about programming languages). At the least, there should be an option to turn this behavior off.

attardi commented 7 years ago

The reason was that math formulas are not good as text, so they are turned into an atomic token. The same is done for code, which also cannot be read as text. Remember that the original purpose of the extractor was to obtain text that could be used to learn language, hence it tries to produce linguistically correct sentences.

BrenBarn commented 7 years ago

But in many cases <code> tags can be read as text. For instance, the Wikipedia article on Python has stuff like this:

The if statement, which conditionally executes a block of code, along with else and elif (a contraction of else-if).

It makes sense to keep <code> in such situations, so it would be good to have an option to do so.

attardi commented 7 years ago

I doubt that "the if statement" may sound correct english, unless you quote 'if'.

BrenBarn commented 7 years ago

Well, let me back up a bit. You said that the "original purpose" of the extractor is to obtain text to learn language. Is that (still) the only purpose? What I'm using it for is to extract article text that will be meaningful to humans. In that context "The if statement" is much more meaningful than "The codice_1 statement." I'm not suggesting to completely drop the old behavior, but just have an option to retain these elements.

attardi commented 7 years ago

Sure, an option makes quite sense.