DLA invalid HTML character problem

HughP / olac

Automatically exported from code.google.com/p/olac

0 stars 0 forks source link

DLA invalid HTML character problem #205

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Records like the following contain invalid HTML characters [1].

  http://www.language-archives.org/item/oai:dfki.de:Malaga

DLA cannot display such records. Instead, it produces a cocoon error page.

[1] http://en.wikipedia.org/wiki/HTML_decimal_character_rendering

Original issue reported on code.google.com by haepal on 17 Dec 2010 at 6:30

GoogleCodeExporter commented 9 years ago

Since it is possible to detect a record containing invalid characters, but that 
deleting/substituting individual characters or fields will have an 
unpredictable result, how about we simply omit the offending record.  Instead 
we should just report the problem on the integrity checks page for that archive.

Original comment by StevenBird1 on 20 Dec 2010 at 7:55

GoogleCodeExporter commented 9 years ago

Implementation details:

Integrity checker was extended to do the invalid html character detection. For 
this, a warning was added to the problem code table: IHC (Invalid HTML 
Character).

OLACA can use this information to exclude items containing invalid html 
characters.

Original comment by haepal on 3 Feb 2011 at 10:26

Changed state: Started

GoogleCodeExporter commented 9 years ago

Fixed (see revision 1569)

As explained in comment #2, IHC (Invalid HTML Character) warning was added to 
the integrity check. OLACA then looks up the INTEGRITY_CHECK table to filter 
out any records whose metadata elements are marked with IHC.

Original comment by haepal on 4 Feb 2011 at 9:42

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

IHC should be marked as an error, not a warning.  The XML we harvest is 
actually ill-formed.  This should count against the archive's overall rating, 
to encourage the archive to fix the problem.

Original comment by StevenBird1 on 10 Feb 2011 at 8:28

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Fixed, i.e. made IHC an error (revision 1574).

You can see it at http://www.language-archives.org/checks.py/anla.uaf.edu. It 
will take a while until the fix is propagated to static pages.

Original comment by haepal on 10 Feb 2011 at 8:38

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

I'm still a bit mystified by this.  What makes soemthing an invalid HTML 
character? Is the original XML data that we harvested valid according to its 
encoding declaration?  If not, then the problem is prior to HTML. If so, then 
we should be able to make it valid HTML.

Original comment by garyfsim...@gmail.com on 24 Feb 2011 at 10:14