eldur / jwbf

Java Wiki Bot Framework is a library to maintain Wikis like Wikipedia based on MediaWiki.
http://jwbf.sourceforge.net/
Apache License 2.0
78 stars 33 forks source link

Invalid XML #60

Closed retrek closed 8 years ago

retrek commented 8 years ago

Does not support Wikia. When using a page from a Wikia wiki, the Exception "java.lang.IllegalArgumentException: Invalid XML" occurs.

TreffnonX commented 8 years ago

I have the same issue with my project using a plain media wiki. I traced the problem further: It originates in XmlConverter.getRootElementWithErrorOpt(...) at line 71 (version 3.1.0), where

try{
        Document doc = builder.build(new ByteArrayInputStream(xml.getBytes(Charsets.UTF_8)));

        root = doc.getRootElement();
}...

fails, but the underlying exception is swallowed and not logged or reported. The xml-String being passed as argument to getRootElementWithErrorOpt is never trimmed! However the string passed starts with multiple line beaks before any non-empty lines follow. maybe the SAXBuilder reads the empty lines as #text-elements and messes up finding the root. Also due to this <?xml version="1.0" encoding="utf-8"?> is not in line 1, which is afaik a violation of xml-syntax. I suspect, that the SAXBuilder fails to deduce a meaningful root object from the String.

I have confirmed this for 3.1.0 (from 3.x branch). Using .trim() before passing the xml-String will solve the convertion! I won't be able to spend time on the analysis of where to actually trim the string (to do it right) for a couple of days. If anyone solves this in the meantime, I'd appreciate it. Edit: Yep, trim fixes it for me.

eldur commented 8 years ago

@TreffnonX do you have an invalid-valid xml file from your mediawiki installation which demonstrates the problem in XmlConverter?

TreffnonX commented 8 years ago

Sadly, it's a company internal wiki. I cannot copy any information or give an example. What I can do is give you a pruned representation of the logged exception:

Exception in thread "main" java.lang.IllegalArgumentException: Invalid XML: 

<?xml version="1.0" encoding="utf-8"?><api><query><general mainpage="Hauptseite" base="http://wiki/mediawiki/index.php/Hauptseite" sitename="xxxxxxxx" generator="MediaWiki 1.12.0" case="first-letter" rights="" lang="de" /></query></api>

    at net.sourceforge.jwbf.core.Optionals.getOrThrow(Optionals.java:29)
    at net.sourceforge.jwbf.mapper.XmlConverter.getRootElementWithError(XmlConverter.java:70)
    at net.sourceforge.jwbf.mediawiki.actions.meta.GetVersion.parse(GetVersion.java:72)
    at net.sourceforge.jwbf.mediawiki.actions.meta.GetVersion.processAllReturningText(GetVersion.java:82)
    at net.sourceforge.jwbf.mediawiki.actions.util.MWAction.processReturningText(MWAction.java:76)
    at net.sourceforge.jwbf.core.actions.HttpActionClient.executeAndProcess(HttpActionClient.java:247)
    at net.sourceforge.jwbf.core.actions.HttpActionClient.get(HttpActionClient.java:224)
    at net.sourceforge.jwbf.core.actions.HttpActionClient.processAction(HttpActionClient.java:160)
    at net.sourceforge.jwbf.core.actions.HttpActionClient.performAction(HttpActionClient.java:139)
    at net.sourceforge.jwbf.core.bots.HttpBot.performAction(HttpBot.java:57)
    at net.sourceforge.jwbf.mediawiki.bots.MediaWikiBot.performAction(MediaWikiBot.java:265)
    at net.sourceforge.jwbf.mediawiki.bots.MediaWikiBot.getPerformedAction(MediaWikiBot.java:270)
    at net.sourceforge.jwbf.mediawiki.bots.MediaWikiBot.getPerformedAction(MediaWikiBot.java:278)
    at net.sourceforge.jwbf.mediawiki.bots.MediaWikiBot.getVersion(MediaWikiBot.java:299)
    at net.sourceforge.jwbf.mediawiki.actions.queries.AllPageTitles.generateRequest(AllPageTitles.java:105)
    at net.sourceforge.jwbf.mediawiki.actions.queries.AllPageTitles.prepareNextRequest(AllPageTitles.java:171)
    at net.sourceforge.jwbf.mediawiki.actions.queries.BaseQuery.doCollection(BaseQuery.java:134)
    at net.sourceforge.jwbf.mediawiki.actions.queries.BaseQuery.hasNext(BaseQuery.java:85)
    at com.inni.rpwikicleanup.App.toFiles(App.java:94)
    at com.inni.rpwikicleanup.App.main(App.java:86)

As is apparent, the newlines after the "Invalid XML: " are the problem. I solved the entire problem by inserting ".trim()" on 4 spots in the XmlConverter:

  @Nonnull
  public static XmlElement getRootElementWithError(final String xml)
  {
    return Optionals.getOrThrow(getRootElementWithErrorOpt(xml.trim()),
        "Invalid XML: " + xml);
  }

  static Optional<XmlElement> getRootElementWithErrorOpt(String xml)
  {
    Optional<String> xmlStringOpt = Optionals.absentIfEmpty(xml.trim());
    if (xmlStringOpt.isPresent())
    {
      SAXBuilder builder = new SAXBuilder();
      org.jdom2.Element root;
      try
      {
        Document doc = builder
            .build(
                new ByteArrayInputStream(xml.trim().getBytes(Charsets.UTF_8)));
        root = doc.getRootElement();
      }
      catch (JDOMException e)
      {
        System.err.println(e.getMessage());
        log.error(xml.trim().replaceAll("\\>\\<", ">\n<"));
        return Optional.absent();
      }

Apparently older mediawiki installations will return an untrimmed response to certain queries. This results in the exception as thrown above. It might be a simpler solution to xml = xml.trim(); before the first operation.

eldur commented 8 years ago

I remember that I've this problem some years ago by a customised template with leading newlines.

But I'll add an patch to repair invalid xml

eldur commented 8 years ago

I tend to close this issue, with change. Ok?

TreffnonX commented 8 years ago

Sure. My issue is already solved. I just reported the issue, so it would not obstruct others ;)