attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.75k stars 966 forks source link

Malformed XML/HTML and invalid links #6

Open cifkao opened 9 years ago

cifkao commented 9 years ago

Extracted text is not being escaped, which in some cases results in malformed XML.

For example, the third sentence of Inequality (mathematics) is rendered as:

For the use of the "<" and ">" signs as punctuation, see <a href="Bracket">Bracket</a>.

The correct output would be:

For the use of the "&lt;" and "&gt;" signs as punctuation, see <a href="Bracket">Bracket</a>.

Similarly, the extracted text of Brian Kernighan contains:

The first documented <a href=""Hello, world!" program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

which should instead be:

The first documented <a href="&quot;Hello, world!&quot; program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

The same applies to page titles in the <doc> elements.

Another issue with links is that most of them are not really hypertext links, but wikilinks. My opinion is that wikilinks should be represented using a different element, e.g. <wikilink>, so the sentence above would become:

The first documented <wikilink page="&quot;Hello, world!&quot; program">"Hello, world!" program</wikilink>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)
attardi commented 9 years ago

The output should be text, not HTML, hence it is correct that HTML entities are converted to characters, exactly as they appear when reading the page. In the case of entities within URL, they should be converted to urlencoding, I suppose.

Blemicek commented 9 years ago

Actually, the output is XML, not a plain text. It should contain XML entities instead ", &, ', < and > because of future parsing.

attardi commented 9 years ago

The input is XML, the output is plain text. That is the intended use. I use it for extracting a text corpus for performing linguistic analysis: parsing, QA, creating word embeddings, etc. If the content were not converted, you would get a lot of crap in the output, including comments, etc.

I guess I could add an option to avoid conversion, if that helps.

On 15/4/2015 12:36, Petr Fanta wrote:

Actually, the output is XML, not a plain text. It should contain XML entities http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent instead ", &, ', < and > because of future parsing.

— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/6#issuecomment-93315528.

attardi commented 9 years ago

Links are now urlencoded.

cifkao commented 9 years ago

If the output is supposed to be plain text, then it does not make sense to represent links, lists and headings using HTML tags (<a>, <h1>, <li>) and it's impossible to parse such output (what if the actual text of the article contains some of these tags, or worse, the <doc> tag, which is unlikely, but possible?).

attardi commented 9 years ago

No tags will be present in the output: they all get stripped out, even if there was a . The anchors are only present if you ask for them using the option to preserve links. Use at your own discretion.

cifkao commented 9 years ago

The <a> tags are not the only issue. If the --sections option is used, <li> and <h1>, <h2> etc. are inserted. If the option is not used, section headings and list items are completely removed (which breaks disambiguation pages, for example, where all the interesting information is present as list items).

attardi commented 9 years ago

Same reason: they are inserted if you ask for them. All tables and lists are removed, because they do not form linguistic sentences. If you want to preserve the structure, you need a different tool.

Blemicek commented 9 years ago

It seems that HTML/XML tags are not removed from a template output some tags are not removed. E.g. in HTML element:

<doc id="274393" url="http://en.wikipedia.org/wiki?curid=274393" title="HTML element">
HTML element

An <abbr title="Hyper Text Markup Language">HTML</abbr> element is an individual component of an <a href="HTML">HTML</a> document or <a href="web page">web page</a>, once this has been parsed into the <a href="HTML Document Object Model">Document Object Model</a>. HTML is composed of a <a href="Tree structure">tree</a> of HTML elements and other <a href="Node (computer science)">nodes</a>, such as text nodes. Each element can have <a href="HTML attribute">HTML attributes</a> specified. Elements can also have content, including other elements and text. Many HTML elements represent <a href="semantics">semantics</a>, or meaning. For example, the codice_1 element represents the title of the document.

...

</doc>

(Anyway, it is a bit confusing to use XML/HTML tags in a plain text.)

attardi commented 9 years ago

I added to the list of ignoredTags. The case of article HTML element is a little peculiar, since it is about HTML, hence the text extracted from the page should contain tags. That page however is written using the extension SyntaxHighlight. So now the content of is not converted.

psibre commented 9 years ago

FWIW, the <doc id="..." url="..." title="...">...</doc> output format implies a certain XML affinity. However, the lack of a common single root element makes many XML parsers barf. IMHO, it would make sense to wrap the entire output text file in some top level element.

attardi commented 9 years ago

I agree that it might be confusing. But the format is not meant to be an XML format. If it were to be XML, than all sort of escaping would have to be done, for instance to handle character entities, etc. But this would defeat the purpose of a text extractor. The output is just text, with tags used to separate the documents. It is meant for easy processing: you can just drop the tags with a one-liner sed script. You are not supposed to use an XML parser, since there is no need for it. Actually the use of an XML is definitely discouraged, for the reasons mentioned above.

psibre commented 9 years ago

I agree that everything between the <doc...> and </doc> is, and should be, plain text. But the fact that the (sparse) metadata is still encoded in an XML-like way with attributes that do use character entities undermines the effort to avoid XML...

For example, the page for "Weird Al" Yankovic produces something like <doc ... title="&quot;Weird Al&quot; Yankovic">. It seems a bit odd to output XML-like elements with attributes, but to discourage XML parsing to extract the attribute and convert the entities. Why not produce something like JSON instead?

psibre commented 9 years ago

I realize that my comments are going a bit off-topic and have opened #30 .

nathj07 commented 8 years ago

Hi, First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence of < as plain text within the <doc>...</doc> tags causes decoding to break. So when decoding the XML using tokenization it breas on the presence of < inside the tags typically with something like XML syntax error on line xx: expected element name after <

It was mentioned above that there could be a flag introduced to handle this so that those characters in that position get escaped. Has any progress been made on this? If need be I'd be happy to help out with that - given a pointer in the right direction.

Thanks

attardi commented 8 years ago

Would it help just enclosing the text within

<!CDATA[ .... ]]>

-- Beppe

On 12/11/2015 18:25, Nathan Davies wrote:

Hi, First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence of |<| as plain text within the |...| tags causes decoding to break. So when decoding the XML using tokenization it breas on the presence of |<| inside the tags typically with something like |XML syntax error on line xx: expected element name after <|

It was mentioned above that there could be a flag introduced to handle this so that those characters in that position get escaped. Has any progress been made on this? If need be I'd be happy to help out with that - given a pointer in the right direction.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/attardi/wikiextractor/issues/6#issuecomment-156174104.

nathj07 commented 8 years ago

An interesting idea, I did think that would work in my use case. However, when I ran some simple tests I end up with an unexpected EOF error.

I think perhaps a flag on the command line to enable escaping of characters within the <doc>...</doc>. How does that sound?

nathj07 commented 8 years ago

How does that PR look for this?