EcoNet-NZ / inaturalist-to-cams

Synchronises observations from iNaturalist to the CAMS Weed App
Apache License 2.0
3 stars 3 forks source link

Unable to synchronise invalid HTML content #86

Open nigelcharman opened 7 months ago

nigelcharman commented 7 months ago

https://github.com/EcoNet-NZ/inaturalist-to-cams/actions/runs/7937935337 fails with the error "Field NotesAndDetails has invalid html content." when synchronising https://www.inaturalist.org/observations/2763972.

NotesAndDetails is shown as:

'NotesAndDetails': '<em>Locality: NEW ZEALAND AK, suburb of Glen Innes, Paddington Reserve (W Tamaki Rd entrance).\r\n\r\n<em>Habitat: One large plant, 3-4 m high. The plant is visible on <a href="https://www.google.co.nz/maps/@-36.870013,174.8628706,3a,21.3y,172.22h,83.33t/data=!3m6!1e1!3m4!1sWNeFx4TNIV6kKLgalvaGuA!2e0!7i13312!8i6656!6m1!1e1">Street View - Nov 2015</a> (above the green transformer). I have uploaded a screen shot, but note that the screen shot shows the plant in Nov 2015.\r\n\r\n<em>Identification: </em><a href="http://naturewatch.org.nz/listed_taxa/5251492">Solanum mauritianum</a><em> Scop., 1788.'

This needs fixing to allow weeds to be synchronised nationally. For now I have reverted to just pull in Woolly Nightshade observations from Kaipatiki.

nigelcharman commented 7 months ago

Potential fix - https://community.esri.com/t5/arcgis-data-interoperability-blog/writing-html-and-other-amp-lt-and-amp-gt-tagged/ba-p/1116925

nigelcharman commented 6 months ago

https://enterprise.arcgis.com/en/server/10.9.1/administer/windows/best-practices-for-configuring-a-secure-environment.htm states that:

xssInputRule specifies the response when code is detected. The options are rejectInvalid or sanitizeInvalid. The rejectInvalid value is the default and is recommended.

nigelcharman commented 6 months ago

Other potential fixes are:

  1. Always HTML encode the iNaturalist description before writing to CAMS
  2. Check whether the iNaturalist description contains HTML and encode it if it does
  3. Catch the error and encode the HTML if we get this error
  4. Ignore the description if it contains HTML content

This assumes that the iNaturalist description is the only one that we use that can contain HTML. This post states that Markdown is supported on comments, identifications, journal posts, and mostly on user profiles and project descriptions. Presumably HTML is the same?

nigelcharman commented 2 months ago

@amazing-will I'm thinking you could reproduce by copying the NotesAndDetails value:

'NotesAndDetails': '<em>Locality: NEW ZEALAND AK, suburb of Glen Innes, Paddington Reserve (W Tamaki Rd entrance).\r\n\r\n<em>Habitat: One large plant, 3-4 m high. The plant is visible on <a href="https://www.google.co.nz/maps/@-36.870013,174.8628706,3a,21.3y,172.22h,83.33t/data=!3m6!1e1!3m4!1sWNeFx4TNIV6kKLgalvaGuA!2e0!7i13312!8i6656!6m1!1e1">Street View - Nov 2015</a> (above the green transformer). I have uploaded a screen shot, but note that the screen shot shows the plant in Nov 2015.\r\n\r\n<em>Identification: </em><a href="http://naturewatch.org.nz/listed_taxa/5251492">Solanum mauritianum</a><em> Scop., 1788.'

nigelcharman commented 2 months ago

For end-to-end testing, you can modify the config/sync_configuration.json file to pull through all Woolly Nightshade observations. Change:

    "Woolly nightshade - Kaipatiki": {
        "file_prefix": "woolly_nightshade_kaipatiki",
        "taxon_ids": ["133287"],
        "place_ids": ["123353"]
    },

to:

    "Woolly nightshade - NZ": {
        "file_prefix": "woolly_nightshade_nz",
        "taxon_ids": ["133287"],
        "place_ids": ["6803"]
    },
amazing-will commented 2 months ago

Reproduced in a feature test using the whole 'Notesanddetails' field as above.

It it seems to be the map reference: <a href="https://www.google.co.nz/maps/@-36.870013,174.8628706,3a,21.3y,172.22h,83.33t/data=!3m6!1e1!3m4!1sWNeFx4TNIV6kKLgalvaGuA!2e0!7i13312!8i6656!6m1!1e1">Street View - Nov 2015</a> that causes the issue. If I remove the end of the url from /maps/ onwards it synchs fine.

it suggests a fix would be to check for any map references and remove them.

nigelcharman commented 2 months ago

That seems a bit specific? I wonder if it might fail on other hrefs or other HTML content? Are you able to try some other HTML strings? We could possibly replace the invalid HTML content with a message like INVALID CONTENT DETECTED - see iNaturalist link for full notes

amazing-will commented 2 months ago

Looking into this a bit more, it's the "=" symbol in the url. href="https://www.otherplace.co.nz/@-aw,wor.d/data,1!23" is okay... but... href="https://www.otherplace.co.nz/@-aw,wor.d/data=,1!23" fails

on that basis we should search any url for an = and put out an INVALID CONTENT message.

nigelcharman commented 2 months ago

Wow. That's kind of like an edge case of an edge case :) Yep, please implement that!

amazing-will commented 2 months ago

Question: are there any other HTML fields we should perhaps check?

nigelcharman commented 2 months ago

Question: are there any other HTML fields we should perhaps check?

I think it's unlikely. My understanding is that it's only supported on comments, descriptions and posts. https://forum.inaturalist.org/t/useful-html-tags-for-inaturalist-comments-and-other-text-wiki/6198/43

amazing-will commented 2 months ago

<div attrib='word'> </div> will also cause an invalid HTML error in the CAMS write. But <div> </div> is okay. I've added a sanitiseHTML method so we can add anything else as we find it. I just hope it doesn't just keep going and going? We could instead remove html.