NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories
https://nceas.github.io/metacatui
Apache License 2.0
42 stars 28 forks source link

Invalid Characters Allowed in Metadata Saved by Metacat UI Editor cause catastrophic dataset error #2481

Open vchendrix opened 4 months ago

vchendrix commented 4 months ago

Description The Metacat UI Editor allowed invalid characters to be saved in metadata. When the Metacat indexer tried to process the metadata file, the following error was encountered:

metacat-index 20240630-23:50:14: [ERROR]: SolrIndex.update - could not update the solr index for the object ess-dive-3619bd077a60b7c-20240624T120319367 since Invalid byte 2 of 4-byte UTF-8 sequence. [edu.ucsb.nceas.metacat.index.SolrIndex:update:656]
org.apache.solr.client.solrj.SolrServerException: Invalid byte 2 of 4-byte UTF-8 sequence.
        at edu.ucsb.nceas.metacat.index.SolrIndex.process(SolrIndex.java:237) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.insert(SolrIndex.java:396) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:697) ~[classes/:?]
        at edu.ucsb.nceas.metacat.index.SolrIndex.update(SolrIndex.java:620) [classes/:?]
        at edu.ucsb.nceas.metacat.index.SystemMetadataEventListener$1.run(SystemMetadataEventListener.java:187) [classes/:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_402]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_402]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_402]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_402]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_402].

The result was that the dataset metadata was not indexed in Solr. However, the resource map was created successfully, rendering the dataset uneditable. The metadata in Solr looked as follows:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"id:ess-dive-3619bd077a60b7c-20240624T120319367",
      "wt":"javabin",
      "version":"2"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "read_count_i":44,
        "id":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "identifier":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "sku":"ess-dive-3619bd077a60b7c-20240624T120319367",
        "_version_":1803150904808964096,
        "serviceCoupling":"false",
        "isService":false,
        "isDocumentedBy":["ess-dive-3619bd077a60b7c-20240624T120319367"],
        "documents":["ess-dive-9725a595229ffc6-20240520T181650760",
          "ess-dive-a947e57390f1fad-20240613T203820095",
          "ess-dive-babae844b274bf2-20240613T212651812",
          "ess-dive-f718cd02247b6b7-20240520T181650806",
          "ess-dive-03a811f10de6c4a-20240613T204125926",
          "ess-dive-3619bd077a60b7c-20240624T120319367",
          "ess-dive-6c73eb2d4ac33cb-20240624T115801116",
          "ess-dive-e87e6b2bb4d0b0d-20240624T115801104",
          "ess-dive-8775aeed8499ba7-20240613T203820082",
          "ess-dive-a2b05a328913511-20240613T203820068",
          "ess-dive-645a4c9d54aacec-20240624T115754244",
          "ess-dive-8641172de4e1937-20240613T210540301",
          "ess-dive-3ac7448d1be1e0f-20240613T210540311",
          "ess-dive-a29fa7c825dea22-20240613T203820108",
          "ess-dive-aac74b2ca73dbee-20240613T203820102",
          "ess-dive-323f59eaa468ca0-20240520T181650795",
          "ess-dive-047dc22f57f82d8-20240624T115801110",
          "ess-dive-f9fd47d9e4c8c34-20240613T203820077",
          "ess-dive-cf5ba5193c8d2ef-20240621T121606390",
          "ess-dive-8742ead85f7c535-20240613T203820088",
          "ess-dive-0d69c0b5a6f7e45-20240613T203820055",
          "ess-dive-35eccae477fcaaa-20240613T203820115",
          "ess-dive-d3ccee76444e6d9-20240624T115801123"],
        "resourceMap":["ess-dive-2c4cdf7a877c0f4-20240624T120319346"],
        "language":""}]
  }
}

Steps to Reproduce

  1. Use Metacat UI Editor to save metadata with invalid characters.
  2. Attempt to index the metadata with Metacat indexer.
  3. Observe the error in the logs as shown above.

Expected behavior The metadata should be properly encoded as UTF-8 before being saved, ensuring that it can be indexed without errors.

Screenshots Screenshot 2024-07-08 at 3 08 58 PM

Additional context We recovered from this by using the API directly to upload a new metadata file that is parseable by the Metacat indexer and then manually create the resource map. This fixed the issue enough to allow the dataset to be edited and published. However, the previous version is in a state where it will never be properly indexed. The Metacat UI metadata editor should ensure that the metadata is encoded properly as UTF-8.

mbjones commented 4 months ago

Thanks for the report, @vchendrix . This could be related to #2167 and certainly seems to be in the same category of character encoding problems. Like that bug, our error handling pipeline in MetacatUI seems to miss that metacat produces an error and silently moves on. This has been a common thread and involves data loss, so I am going to label this as critical. I will discuss this with @robyngit and @rushirajnenuji to try to figure out a path forward. Thanks.

vchendrix commented 4 months ago

Thanks for the report, @vchendrix . This could be related to #2167 and certainly seems to be in the same category of character encoding problems. Like that bug, our error handling pipeline in MetacatUI seems to miss that metacat produces an error and silently moves on. This has been a common thread and involves data loss, so I am going to label this as critical. I will discuss this with @robyngit and @rushirajnenuji to try to figure out a path forward. Thanks.

No problem. The solution will probably be the same in MetacatUi. The only noticeable difference is that in this case Metacat accepts the update but fails to parse the EML for the solr index which was very difficult to remedy. In #2167 Metacat rejects the update thus making it easier to recover.

mbjones commented 4 months ago

@vchendrix could you attachto this ticket the original EML document that triggers this SOLR indexing error? It would be very helpful to be able to reproduce what you mean by "invalid characters" with a concrete reproducible example.

vchendrix commented 4 months ago

what you mean by "invalid characters" with a concrete reproducible example.

Here is the URL: https://data.ess-dive.lbl.gov/catalog/d1/mn/v2/object/ess-dive-3619bd077a60b7c-20240624T120319367

The_importance_of_accounting_for_landscape.xml

vchendrix commented 4 months ago

The_importance_of_accounting_for_landscape.xml

@mbjones NOTE that once opened up in an editor the characters are automatically encoded and I was able to upload and have it parse successfully. The characters were garbage but it sidestepped the error. The invalid characters, I suspect, are in Step 7 of the Methods.