NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Harvest skips XML label with bad prolog #107

Closed jordanpadams closed 1 year ago

jordanpadams commented 1 year ago

🐛 Describe the bug

Per user, product is not currently in the registry, and harvest is skipping the product.

After further investigation looks like the XML prolog (specifically the schematypens) is invalid. If we can't figure out a way to force XML readers to read this, we may need to figure out another means to parse this label to at least ingest something.

📜 To Reproduce

  1. Try to ingest https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/nms_bundle/document/nms_pds_sis.xml (or the parent collection and all products)
  2. Note that is skips the product
  3. But the data is not in the registry:
    $ curl -u $registry_user:$registry_pass 'https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com/registry/_search?q=lidvid:"urn:nasa:pds:ladee_nms:document:nms_pds_sis::1.9"}&pretty'
    {
    "took" : 2,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    },
    "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
    }
    }

But query does work for other products:

$ curl -u $registry_user:$registy_pass 'https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com/registry/_search?q=lidvid:"urn:nasa:pds:mars2020_meda:data_derived_env::2.0"&pretty'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 12.2451,
    "hits" : [
      {
        "_index" : "registry",
        "_type" : "_doc",
        "_id" : "urn:nasa:pds:mars2020_meda:data_derived_env::2.0",
        "_score" : 12.2451,
        "_source" : {
          "pds:Citation_Information/pds:description" : "Mars 2020 Mars Environmental Dynamics Analyzer (MEDA) Environmental Derived Data Collection",
          "pds:Collection/pds:collection_type" : "Data",
          "pds:Field_Delimited/pds:name" : [
            "Member Status",
            "LIDVID_LID"
          ],
          "lid" : "urn:nasa:pds:mars2020_meda:data_derived_env",
          "pds:Primary_Result_Summary/pds:purpose" : "Science",
          "pds:Field_Delimited/pds:description" : [
            "P indicates primary member of the collection S indicates secondary member of the collection",
            "This column specifies the LID of the files that comprise the collection."
          ],
          "pds:Inventory/pds:parsing_standard_id" : "PDS DSV 1",
          "pds:Record_Delimited/pds:groups" : "0",
          "ref_lid_instrument" : "urn:nasa:pds:context:instrument:mars2020.meda",
          "ops:Harvest_Info/ops:node_name" : "PDS_ATM",
          "pds:Inventory/pds:record_delimiter" : "Carriage-Return Line-Feed",
          "pds:Time_Coordinates/pds:stop_date_time" : "2021-08-21T18:44:56.528Z",
          "vid" : "2.0",
          "product_class" : "Product_Collection"
...

🕵️ Expected behavior

??? Not sure why it is skipping.

📚 Version of Software Used

3.6.0

🩺 Test Data / Additional context

https://atmos.nmsu.edu/PDS/data/PDS4/LADEE/nms_bundle/document/nms_pds_sis.xml


🦄 Related requirements

⚙️ Engineering Details

jordanpadams commented 1 year ago

After discussions with ATM, closing as wontfix. This data is being fixed.