Closed saracarl closed 1 month ago
I don't really see why the validations are failing at all -- it looks like we validate length, but I'm not sure what validates :intro_block, :html exactly does.
Illegal character "&" in raw string "This is a collection of reports about illicit gold buying (aka I.G.B.) belonging to R. W. Schumacher, senior personnel for H. Eckstein & Co. (later known as Corner House), one of Johannesburg, South Africa’s most influential gold mining companies. Most of the contents are letters from Mr. Clancy to the Transvaal Town Police about amalgam (unwrought gold) thefts at Crown Mines. These documents are shared for community transcription with permission from the "
Line: 1
Position: 684
Last 80 unconsumed characters:
<a href='https://researcharchives.wits.ac.za/barlow-world-rand-mines-archive?_gl=
Initially the issue that prompted us to introduce this html validator was that there are exports that are failing due to invalid html syntax (missing closing tags when copy was cut off due to length, etc). The offending line is here doc = REXML::Document.new(preprocessed)
So although we are trying to validate html, it is mostly geared towards preventing errors here in export. With that in mind I used the same line to validate in html_validator.rb, only wrapping it in try catch so if it fails then for sure their export would also fail.
Although my implementation missed one thing, I did not see the part where you pre-process the text before passing thru rexml, so that the characters &
in the given example above tells that the validation fails.
I implemented that missing part of preprocessing in this html validation.
How did you get the error about the ampersand? Is there a way to expose that to users when they are manually entering invalid HTML?
I tried it locally, ran the code in rails c manually. This block in particular
text = # the text sarah gave
REXML::Document.new("<html>#{text}</html>")
It will throw error and traceback, I don't think we can display the result in frontend.
Pity - it really seems like it would be useful. I think we do catch something like this when people save a transcription; let me check...
Trying to make a collection private is leading to a 422 error:
The logs show:
So I ran their description through on online html validator:
Full description, fyi:
So there's 2 problems here:
1) we fail, rather than give the user a useful message. 2) their html was "good enough". So maybe we need a different validation approach?