Open GoogleCodeExporter opened 9 years ago
Since it is possible to detect a record containing invalid characters, but that
deleting/substituting individual characters or fields will have an
unpredictable result, how about we simply omit the offending record. Instead
we should just report the problem on the integrity checks page for that archive.
Original comment by StevenBird1
on 20 Dec 2010 at 7:55
Implementation details:
Integrity checker was extended to do the invalid html character detection. For
this, a warning was added to the problem code table: IHC (Invalid HTML
Character).
OLACA can use this information to exclude items containing invalid html
characters.
Original comment by haepal
on 3 Feb 2011 at 10:26
Fixed (see revision 1569)
As explained in comment #2, IHC (Invalid HTML Character) warning was added to
the integrity check. OLACA then looks up the INTEGRITY_CHECK table to filter
out any records whose metadata elements are marked with IHC.
Original comment by haepal
on 4 Feb 2011 at 9:42
IHC should be marked as an error, not a warning. The XML we harvest is
actually ill-formed. This should count against the archive's overall rating,
to encourage the archive to fix the problem.
Original comment by StevenBird1
on 10 Feb 2011 at 8:28
Fixed, i.e. made IHC an error (revision 1574).
You can see it at http://www.language-archives.org/checks.py/anla.uaf.edu. It
will take a while until the fix is propagated to static pages.
Original comment by haepal
on 10 Feb 2011 at 8:38
I'm still a bit mystified by this. What makes soemthing an invalid HTML
character? Is the original XML data that we harvested valid according to its
encoding declaration? If not, then the problem is prior to HTML. If so, then
we should be able to make it valid HTML.
Original comment by garyfsim...@gmail.com
on 24 Feb 2011 at 10:14
Original issue reported on code.google.com by
haepal
on 17 Dec 2010 at 6:30