benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
170 stars 50 forks source link

Handle angle brackets more gracefully #1357

Open benwbrum opened 5 years ago

benwbrum commented 5 years ago

The Howitt and Fison papers seem to use angle brackets in ways that make our XML encoder error out. We should investigate the work-flow for these real-world texts, and make sure that those projects using HTML tags are still able to continue doing so.

Sample erroring texts from the error logs:

TRANSCRIPTION   User    ID: 214673      Email: redacted   Display Name: Margaret T. Newman
TRANSCRIPTION   Collection      ID: 148 Title:Howitt and Fison Papers   Owner Email: redacted
TRANSCRIPTION   Work    ID: 13991       Title: hw0182 Howitt to Cameron  29/11/1899
TRANSCRIPTION   Page    ID: 452426      Position: 2     Title:2
TRANSCRIPTION   Source Text:
BEGIN_SOURCE_TEXT
Letter to A L Cameron
Murrumbong [?  ?] I take the Dieri Tribe (Central Australia) as my example with Diagram I and Diagram II shins [sic] the analagous case of the Wiradjuri Tribe. The former is drawn up from the marriages, [?defunts?] and relationships of actual individuals. The latter is drawn up from the marriages & [?desunts?] [?   ?] fr the law of the subclasses:
Diagram I
(4) Grandfather           Grandfather (8)
Grandmother               Grandmother 
(3) (mother mother)    (mother moth (7) [sic] 
(2) mother                      mother (6) (6) [sic]
(1) man  <[?110 ?] 🔙nana> woman 5 [sic] 
Diagram II
(4) I [?sr?]ai       Ruth (8) 
(3) Rubbretia     Ifall (7) Ip — mui -  Rbo — Kula - 
(2) Rubbretia (crossed out) Matha    Brother (6)
(1) Kulbi   —      I[?p?]atter (5) 
What I desire to learn is  whether the Diagram II represents
the same rule a [sic] Diagram I. That is whether the man (1) is permitted to mary [sic] or [?unflits?] proper (all crossed out from and including the word ‘man’) the woman (5) is the usul [sic] or proper if wife of her mother’s (6) - mother’s (7) - brother’s ((4) daughter’s (2) son (1).[sic] Assuming [?let. that?] this is not them I should expect to find that the father of
(5) promised her to (1) — unless the Kumilarin practice is [?not?]of the 
END_SOURCE_TEXT

E, [2019-06-09T06:20:11.421807 #7679] ERROR -- : TRANSCRIPTION  2019-06-09 06:20:11 +0000       ERROR   EXCEPTION       malformed XML: missing tag start
Line: 4
Position: 1468
Last 80 unconsumed characters:
<[?110 ?] 🔙nana> woman 5 [sic] <lb/>Diagram II<lb/>(4) I [?sr?]ai       Ruth (8) 
E, [2019-06-09T06:20:11.422020 #7679] ERROR -- : /usr/local/rvm/rubies/ruby-2.3.7/lib/ruby/2.3.0/rexml/parsers/baseparser.rb:375:in `pull_event'
/usr/local/rvm/rubies/ruby-2.3.7/lib/ruby/2.3.0/rexml/parsers/baseparser.rb:185:in `pull'
/usr/local/rvm/rubies/ruby-2.3.7/lib/ruby/2.3.0/rexml/parsers/treeparser.rb:23:in `parse'
/usr/local/rvm/rubies/ruby-2.3.7/lib/ruby/2.3.0/rexml/document.rb:288:in `build'
/usr/local/rvm/rubies/ruby-2.3.7/lib/ruby/2.3.0/rexml/document.rb:45:in `initialize'
/home/fromthepage/deployment/releases/20190516185510/app/models/xml_source_processor.rb:337:in `new'
/home/fromthepage/deployment/releases/20190516185510/app/models/xml_source_processor.rb:337:in `update_links_and_xml'
/home/fromthepage/deployment/releases/20190516185510/app/models/xml_source_processor.rb:102:in `wiki_to_xml'
/home/fromthepage/deployment/releases/20190516185510/app/models/xml_source_processor.rb:86:in `process_source'
bencomp commented 4 years ago

Since we fixed our email setup I have been getting emails about pages failing to be saved because of angle brackets (and lines of = interpreted as wiki markup). Part of the solution must be better instructions, but I wouldn't mind more graceful handling of unintended markup.

Maybe at some point a rich text editor may provide help with parsing text before saving the text?

benwbrum commented 3 years ago

The most recent release of FromThePage includes a code editor with syntax highlighting. I'm still not sure it would rescue the Howitt and Fison example, but we should test it again.