NASA-PDS / doi-service

Service and tools for generating DOIs for PDS bundles, collections, and data sets
https://nasa-pds.github.io/doi-service
Other
2 stars 3 forks source link

Unable to generate / export json report of DOI metadata #397

Closed rsjoyner closed 1 year ago

rsjoyner commented 1 year ago

šŸ› Describe the bug

When attempting to generate a report of all DOIs between dates, no transactions listed in output file:

pds-doi-cmd list -start 1990-01-01 -end 2022-12-27 -f label > DataCite_dump_results_20221227.json

Generates file of '0' length.

šŸ“œ To Reproduce

Steps to reproduce the behavior:

  1. Enter command: pds-doi-cmd list -start 1990-01-01 -end 2022-12-27 -f label > DataCite_dump_results_20221227.json
  2. Service generates ERROR
  3. See error

Traceback (most recent call last): File "/home/pds4/pds-doi-service/bin/pds-doi-cmd", line 8, in sys.exit(main()) File "/home/pds4/pds-doi-service/lib/python3.9/site-packages/pds_doi_service/core/cmd/pds_doi_cmd.py", line 42, in main output = action.run(**kwargs) File "/home/pds4/pds-doi-service/lib/python3.9/site-packages/pds_doiservice/core/actions/list.py", line 340, in run dois, = self._web_parser.parse_dois_from_label(label_contents) File "/home/pds4/pds-doi-service/lib/python3.9/site-packages/pds_doi_service/core/outputs/datacite/datacite_web_parser.py", line 354, in parse_dois_from_label datacite_records = json.loads(label_text)["data"] File "/usr/local/python-3.9.5/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/local/python-3.9.5/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/python-3.9.5/lib/python3.9/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 103 column 199 (char 4429)

šŸ•µļø Expected behavior

The expected error is that a valid set of DOI metadata is written to the output file.

šŸ“š Version of Software Used

pds-doi-service==2.1.3

alexdunnjpl commented 1 year ago

Based on stacktrace and confirmation that doi db in question has data, suspect bad data.

Will need access to prod box or a copy of prod's path/to/pds-doi-service/transaction_history to perform further troubleshooting, which is likely worthwhile both for user and for project as bad labels should (presumably) fail during reserve/release.

tloubrieu-jpl commented 1 year ago

@rsjoyner when you have a chance can you share the SQLite .db file that was used when you saw the error ?

Thanks

alexdunnjpl commented 1 year ago

@tloubrieu-jpl I'll LFT you a copy of the sqlite db which Ron provided, but my guess is that we'll need the transactions' json to get any further on this

alexdunnjpl commented 1 year ago

Unable to reproduce on pds@pdscloud-prod1 using the instance at /home/pds4/pds-doi-service.

Awaiting confirmation from @rsjoyner wrt whether

alexdunnjpl commented 1 year ago

The offending DOI has id 10.26033/3k3c-5713 and refers to product urn:nasa:pds:satellite-phoebe.cassini.shape-models-maps::1.0. The label file is located at pds-prod1:/home/pds4/pds-doi-service/transaction_history/unk/10.26033/3k3c-5713/2022-12-14T20:06:48+00:00/output.json and contains the following malformed json attribute (N.B. non-escaped double-quotes)

"descriptions": [
                    {
                        "description": "This bundle contains a shape model for the Saturnian moon Phoebe, along with quality assessment data. The global model is similar to the previously archived "Gaskell Phoebe Shape Model", but is provided in multiple formats.",
                        "descriptionType": "Abstract",
                        "lang": "en"
                    }
                ]

@jordanpadams this should unblock @rsjoyner for the time being, but I should loop back to this later and ensure that we aren't parsing values from XML labels without escaping them for JSON write. If that's not the case, and the problem is in the source label data (unlikely), then we'll need to make a decision about how doi-service should handle that.

@rsjoyner are you able to provide a copy of the input label for urn:nasa:pds:satellite-phoebe.cassini.shape-models-maps::1.0?

rsjoyner commented 1 year ago

Based on the DOI: "doi": "10.26033/ehkj-xj95". The base DOI for the DOI service is: "10.17189 AND that the Node is SBN, AND that I can't locate the original XML file, I suspect that this DOI was NOT minted by the EN DOI service ?

Was there a bulk "merge" of SBN DOIs on: "updated": "2022-02-08T18:07:35.000000Z"? OR, am I just confused once again.

rsjoyner commented 1 year ago

Note that the DOI value and the description in the errant DOI (above) do not match the metadata in my "dump results".
The urn also is different from the errant: "identifier": "urn:nasa:pds:gaskell.phoebe.shape-model::1.0"

This is very strange to me. Here is the DOI metadata that has: "title": "Gaskell Phoebe Shape Model Bundle V1.0". This is the only record having "Gaskell".

    {
        "id": "10.26033/ehkj-xj95",
        "type": "dois",
        "attributes": {
            "doi": "10.26033/ehkj-xj95",
            "suffix": "ehkj-xj95",
            "identifiers": [
                {
                    "identifier": "urn:nasa:pds:gaskell.phoebe.shape-model::1.0",
                    "identifierType": "PDS4 Bundle LIDVID"
                }
            ],
            "creators": [
                {
                    "nameType": "Personal",
                    "name": "Robert W. Gaskell",
                    "nameIdentifiers": [
                        {
                            "schemeUri": "https://orcid.org",
                            "nameIdentifier": "https://orcid.org/0000-0002-2293-7879",
                            "nameIdentifierScheme": "ORCID"
                        }
                    ]
                }
            ],
            "titles": [
                {
                    "title": "Gaskell Phoebe Shape Model Bundle V1.0",
                    "lang": "en"
                }
            ],
            "publisher": "NASA Planetary Data System",
            "publicationYear": "2020",
            "subjects": [
                { "subject": "Saturnian satellites" }
            ],
            "contributors": [
                {
                    "nameType": "Organizational",
                    "name": "Planetary Data System: PDS Small Bodies Node",
                    "contributorType": "DataCurator"
                }
            ],
            "types": {
                "resourceTypeGeneral": "Dataset",
                "resourceType": "PDS4 Refereed Data Bundle"
            },
            "relatedIdentifiers": [
            ],
            "descriptions": [
                {
                    "description": "The shape model of Phoebe derived by Robert Gaskell from Cassini images. The model is provided in the implicitly connected quadrilateral (ICQ) format. This version of the model was prepared on August 4, 2012. Vertex-facet versions of the models are also provided.",
                    "descriptionType": "Abstract",
                    "lang": "en"
                }
            ],
            "url": "https://sbn.psi.edu/pds/resource/phoebeshape.html",
            "created": "2021-05-21T21:46:41.000000Z",
            "updated": "2022-02-08T18:12:24.000000Z",
            "state": "findable",
            "language": "en",
            "schemaVersion": "http://datacite.org/schema/kernel-4"
        }
    },
jordanpadams commented 1 year ago

will create new ticket to better handle escaping of quotes for input data