benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
170 stars 49 forks source link

422 error making collection private #4255

Closed saracarl closed 1 month ago

saracarl commented 1 month ago

Trying to make a collection private is leading to a 422 error:

image

The logs show:

I, [2024-07-30T14:47:25.164512 #502328] INFO -- : Started GET "/collection/restrict_collection?collection_id=32001027" for 136.62.254.224 at 2024-07-30 14:47:25 +0000 I, [2024-07-30T14:47:25.166849 #502328] INFO -- : Processing by CollectionController#restrict_collection as HTML I, [2024-07-30T14:47:25.167053 #502328] INFO -- : Parameters: {"collection_id"=>"32001027"} I, [2024-07-30T14:47:25.180853 #502328] INFO -- : Completed 500 in 14ms (ActiveRecord: 2.3ms | Allocations: 6996) F, [2024-07-30T14:47:25.184730 #502328] FATAL -- :
ActiveRecord::RecordInvalid (Validation failed: Description invalid html syntax):

app/controllers/collection_controller.rb:395:in restrict_collection' app/controllers/application_controller.rb:64:inswitch_locale'

So I ran their description through on online html validator:

image

Full description, fyi:

<p>This is a collection of reports about illicit gold buying (aka I.G.B.) belonging to R. W. Schumacher, senior personnel for H. Eckstein & Co. (later known as Corner House), one of Johannesburg, South Africa’s most influential gold mining companies. Most of the contents are letters from Mr. Clancy to the Transvaal Town Police about amalgam (unwrought gold) thefts at Crown Mines. These documents are shared for community transcription with permission from the <a href="https://researcharchives.wits.ac.za/barlow-world-rand-mines-archive?_gl=1*1suq9w*_ga*NDMzNTkxNjA1LjE3MTc2MTY1OTE.*_ga_JPCF6M80CQ*MTcxNzYxNjU5MS4xLjEuMTcxNzYxNjU5Mi41OS4wLjI5Mzk2ODQzNg.."  target="_blank">Barlow World Rand Mines Archives</a>, in Johannesburg, South Africa, where this collection is held.</p>

<p>Please help with transcribing and proof-reading transcription! <b>Note: Page 5 is a bad scan.</b></p>

<p>This collection contains content that may be offensive. Some users may find it difficult to read and transcribe. Please view <a href="https://www.library.dartmouth.edu/digital/policies/content"  target="_blank">Dartmouth Library's Statement on Potentially Harmful Content.</a></p>

<p><a href="https://docs.google.com/presentation/d/1deGxa9q9fcPF0InZNxJiBI6UgwUb-1U3dPqSVajvcNc/edit#slide=id.p"  target="_blank"> Transcription Tutorial</a></p>

So there's 2 problems here:

1) we fail, rather than give the user a useful message. 2) their html was "good enough". So maybe we need a different validation approach?

benwbrum commented 1 month ago

I don't really see why the validations are failing at all -- it looks like we validate length, but I'm not sure what validates :intro_block, :html exactly does.

WillNigel23 commented 1 month ago
Illegal character "&" in raw string "This is a collection of reports about illicit gold buying (aka I.G.B.) belonging to R. W. Schumacher, senior personnel for H. Eckstein & Co. (later known as Corner House), one of Johannesburg, South Africa’s most influential gold mining companies. Most of the contents are letters from Mr. Clancy to the Transvaal Town Police about amalgam (unwrought gold) thefts at Crown Mines. These documents are shared for community transcription with permission from the "
Line: 1
Position: 684
Last 80 unconsumed characters:
<a href='https://researcharchives.wits.ac.za/barlow-world-rand-mines-archive?_gl=

Initially the issue that prompted us to introduce this html validator was that there are exports that are failing due to invalid html syntax (missing closing tags when copy was cut off due to length, etc). The offending line is here doc = REXML::Document.new(preprocessed)

So although we are trying to validate html, it is mostly geared towards preventing errors here in export. With that in mind I used the same line to validate in html_validator.rb, only wrapping it in try catch so if it fails then for sure their export would also fail.

Although my implementation missed one thing, I did not see the part where you pre-process the text before passing thru rexml, so that the characters & in the given example above tells that the validation fails.

I implemented that missing part of preprocessing in this html validation.

benwbrum commented 1 month ago

How did you get the error about the ampersand? Is there a way to expose that to users when they are manually entering invalid HTML?

WillNigel23 commented 1 month ago

I tried it locally, ran the code in rails c manually. This block in particular

text = # the text sarah gave

REXML::Document.new("<html>#{text}</html>")

It will throw error and traceback, I don't think we can display the result in frontend.

benwbrum commented 1 month ago

Pity - it really seems like it would be useful. I think we do catch something like this when people save a transcription; let me check...

benwbrum commented 1 month ago

Found it: https://github.com/benwbrum/fromthepage/blob/development/app/controllers/transcribe_controller.rb#L252-L266