Open BrenBarn opened 7 years ago
The reason was that math formulas are not good as text, so they are turned into an atomic token. The same is done for code, which also cannot be read as text. Remember that the original purpose of the extractor was to obtain text that could be used to learn language, hence it tries to produce linguistically correct sentences.
But in many cases <code>
tags can be read as text. For instance, the Wikipedia article on Python has stuff like this:
The
if
statement, which conditionally executes a block of code, along withelse
andelif
(a contraction of else-if).
It makes sense to keep <code>
in such situations, so it would be good to have an option to do so.
I doubt that "the if statement" may sound correct english, unless you quote 'if'.
Well, let me back up a bit. You said that the "original purpose" of the extractor is to obtain text to learn language. Is that (still) the only purpose? What I'm using it for is to extract article text that will be meaningful to humans. In that context "The if statement" is much more meaningful than "The codice_1 statement." I'm not suggesting to completely drop the old behavior, but just have an option to retain these elements.
Sure, an option makes quite sense.
What is up with the behavior whereby any article text inside
<code>
tags is replaced by a single "word" of the form "codice_N", where N is the number of the code tag. That is, the first<code>
tag is replaced by "codice_1", the second by "codice_2", etc. Looking at the code, it appears the same will be done for<math>
tags (replace by "formula_N").This doesn't appear to be documented and it's unclear what the purpose of it is. It makes the extracted output quite useless for some articles (like those about programming languages). At the least, there should be an option to turn this behavior off.