MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

broken HTML structure #346

Closed sneumann closed 2 years ago

sneumann commented 2 years ago

We have some broken HTML structure, which prevents some clients from scraping content.

I think the <meta charset="UTF-8"> should be <meta charset="UTF-8"/> (closing /).

Yours, Steffen

wget -q -O- https://msbi.ipb-halle.de/MassBank/RecordDisplay?id=PB000123 | xmllint -
-:86: parser error : Opening and ending tag mismatch: link line 43 and head
</head>
       ^
-:123: parser error : Opening and ending tag mismatch: img line 43 and a
    </a>
        ^
-:130: parser error : Opening and ending tag mismatch: input line 43 and form
    </form>
           ^
-:139: parser error : Opening and ending tag mismatch: input line 43 and div
</div>
      ^
-:141: parser error : Opening and ending tag mismatch: form line 43 and div
    </div>
          ^
-:163: parser error : XML declaration allowed only at the start of the document
                    <?xml version="1.0" encoding="UTF-8" standalone="no"?><svg xmlns="http://ww
                         ^
-:237: parser error : Opening and ending tag mismatch: br line 43 and div
                </div>
                      ^
-:243: parser error : AttValue: " or ' expected
                        target=”_blank”>metabolomics-usi visualisation</a>
                               ^
-:243: parser error : attributes construct error
                        target=”_blank”>metabolomics-usi visualisation</a>
                               ^
-:243: parser error : Couldn't find end of Start Tag a line 241
                        target=”_blank”>metabolomics-usi visualisation</a>
                               ^
-:243: parser error : Opening and ending tag mismatch: div line 37 and a
                        target=”_blank”>metabolomics-usi visualisation</a>
                                                                              ^
-:283: parser error : EntityRef: expecting ';'
&nbsp&nbsp119.051&nbsp467.616&nbsp45<br>
     ^
-:283: parser error : EntityRef: expecting ';'
&nbsp&nbsp119.051&nbsp467.616&nbsp45<br>
                 ^
-:283: parser error : EntityRef: expecting ';'
&nbsp&nbsp119.051&nbsp467.616&nbsp45<br>
                             ^
-:283: parser error : EntityRef: expecting ';'
&nbsp&nbsp119.051&nbsp467.616&nbsp45<br>
                                    ^
-:284: parser error : EntityRef: expecting ';'
&nbsp&nbsp123.044&nbsp370.662&nbsp36<br>
     ^
-:284: parser error : EntityRef: expecting ';'
&nbsp&nbsp123.044&nbsp370.662&nbsp36<br>
                 ^
-:284: parser error : EntityRef: expecting ';'
&nbsp&nbsp123.044&nbsp370.662&nbsp36<br>
                             ^
-:284: parser error : EntityRef: expecting ';'
&nbsp&nbsp123.044&nbsp370.662&nbsp36<br>
                                    ^
-:285: parser error : EntityRef: expecting ';'
&nbsp&nbsp147.044&nbsp6078.145&nbsp606<br>
     ^
-:285: parser error : EntityRef: expecting ';'
&nbsp&nbsp147.044&nbsp6078.145&nbsp606<br>
                 ^
-:285: parser error : EntityRef: expecting ';'
&nbsp&nbsp147.044&nbsp6078.145&nbsp606<br>
                              ^
-:285: parser error : EntityRef: expecting ';'
&nbsp&nbsp147.044&nbsp6078.145&nbsp606<br>
                                      ^
-:286: parser error : EntityRef: expecting ';'
&nbsp&nbsp148.048&nbsp113.113&nbsp10<br>
     ^
-:286: parser error : EntityRef: expecting ';'
&nbsp&nbsp148.048&nbsp113.113&nbsp10<br>
                 ^
-:286: parser error : EntityRef: expecting ';'
&nbsp&nbsp148.048&nbsp113.113&nbsp10<br>
                             ^
-:286: parser error : EntityRef: expecting ';'
&nbsp&nbsp148.048&nbsp113.113&nbsp10<br>
                                    ^
-:287: parser error : EntityRef: expecting ';'
&nbsp&nbsp151.039&nbsp125.695&nbsp11<br>
     ^
-:287: parser error : EntityRef: expecting ';'
&nbsp&nbsp151.039&nbsp125.695&nbsp11<br>
                 ^
-:287: parser error : EntityRef: expecting ';'
&nbsp&nbsp151.039&nbsp125.695&nbsp11<br>
                             ^
-:287: parser error : EntityRef: expecting ';'
&nbsp&nbsp151.039&nbsp125.695&nbsp11<br>
                                    ^
-:288: parser error : EntityRef: expecting ';'
&nbsp&nbsp153.018&nbsp10000.000&nbsp999<br>
     ^
-:288: parser error : EntityRef: expecting ';'
&nbsp&nbsp153.018&nbsp10000.000&nbsp999<br>
                 ^
-:288: parser error : EntityRef: expecting ';'
&nbsp&nbsp153.018&nbsp10000.000&nbsp999<br>
                               ^
-:288: parser error : EntityRef: expecting ';'
&nbsp&nbsp153.018&nbsp10000.000&nbsp999<br>
                                       ^
-:289: parser error : EntityRef: expecting ';'
&nbsp&nbsp154.023&nbsp270.265&nbsp26<br>
     ^
-:289: parser error : EntityRef: expecting ';'
&nbsp&nbsp154.023&nbsp270.265&nbsp26<br>
                 ^
-:289: parser error : EntityRef: expecting ';'
&nbsp&nbsp154.023&nbsp270.265&nbsp26<br>
                             ^
-:289: parser error : EntityRef: expecting ';'
&nbsp&nbsp154.023&nbsp270.265&nbsp26<br>
                                    ^
-:290: parser error : EntityRef: expecting ';'
&nbsp&nbsp179.036&nbsp141.192&nbsp13<br>
     ^
-:290: parser error : EntityRef: expecting ';'
&nbsp&nbsp179.036&nbsp141.192&nbsp13<br>
                 ^
-:290: parser error : EntityRef: expecting ';'
&nbsp&nbsp179.036&nbsp141.192&nbsp13<br>
                             ^
-:290: parser error : EntityRef: expecting ';'
&nbsp&nbsp179.036&nbsp141.192&nbsp13<br>
                                    ^
-:291: parser error : EntityRef: expecting ';'
&nbsp&nbsp189.058&nbsp176.358&nbsp16<br>
     ^
-:291: parser error : EntityRef: expecting ';'
&nbsp&nbsp189.058&nbsp176.358&nbsp16<br>
                 ^
-:291: parser error : EntityRef: expecting ';'
&nbsp&nbsp189.058&nbsp176.358&nbsp16<br>
                             ^
-:291: parser error : EntityRef: expecting ';'
&nbsp&nbsp189.058&nbsp176.358&nbsp16<br>
                                    ^
-:292: parser error : EntityRef: expecting ';'
&nbsp&nbsp255.067&nbsp169.007&nbsp15<br>
     ^
-:292: parser error : EntityRef: expecting ';'
&nbsp&nbsp255.067&nbsp169.007&nbsp15<br>
                 ^
-:292: parser error : EntityRef: expecting ';'
&nbsp&nbsp255.067&nbsp169.007&nbsp15<br>
                             ^
-:292: parser error : EntityRef: expecting ';'
&nbsp&nbsp255.067&nbsp169.007&nbsp15<br>
                                    ^
-:293: parser error : EntityRef: expecting ';'
&nbsp&nbsp273.076&nbsp5286.093&nbsp527<br>
     ^
-:293: parser error : EntityRef: expecting ';'
&nbsp&nbsp273.076&nbsp5286.093&nbsp527<br>
                 ^
-:293: parser error : EntityRef: expecting ';'
&nbsp&nbsp273.076&nbsp5286.093&nbsp527<br>
                              ^
-:293: parser error : EntityRef: expecting ';'
&nbsp&nbsp273.076&nbsp5286.093&nbsp527<br>
                                      ^
-:294: parser error : EntityRef: expecting ';'
&nbsp&nbsp274.081&nbsp246.689&nbsp23<br>
     ^
-:294: parser error : EntityRef: expecting ';'
&nbsp&nbsp274.081&nbsp246.689&nbsp23<br>
                 ^
-:294: parser error : EntityRef: expecting ';'
&nbsp&nbsp274.081&nbsp246.689&nbsp23<br>
                             ^
-:294: parser error : EntityRef: expecting ';'
&nbsp&nbsp274.081&nbsp246.689&nbsp23<br>
                                    ^
-:296: parser error : Opening and ending tag mismatch: br line 37 and div
    </div>
          ^
-:331: parser error : Entity 'copy' not defined
        Copyright &copy; 2006 MassBank Project; 2011 <a href="https://www.norman-netwo
                        ^
-:332: parser error : Opening and ending tag mismatch: br line 37 and div
    </div>  
          ^
-:338: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:342: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:346: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:350: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:354: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:358: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:362: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:366: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:370: parser error : Opening and ending tag mismatch: img line 37 and div
        </div>
              ^
-:376: parser error : Opening and ending tag mismatch: div line 37 and body
</body>
       ^
-:377: parser error : Opening and ending tag mismatch: div line 37 and html
</html>
       ^
-:377: parser error : EndTag: '</' not found
</html>
meier-rene commented 2 years ago

Thanks for spoting. Impressive how relaxed popular browser handle broken html syntax. I fixed it in dev and rolled out on ipb MassBank. I checked with https://validator.w3.org. Please note: xmllint is not exactly made for html. If you really want to scrape a html with xmllint use -html. Even in html mode xmllint complains a bit about some html5 tags and heavily about inline svg. But I know no better solution. My suggestion: use -html and pipe stderr to /dev/null.

wget -q -O- https://msbi.ipb-halle.de/MassBank/RecordDisplay?id=PB000123 | xmllint -html --xpath '//html/body/header' - 2> /dev/null