dspinellis / alexandria3k

Local relational access to openly-available publication data sets
GNU General Public License v3.0
79 stars 14 forks source link

Exception in DataCite population #51

Closed dspinellis closed 1 month ago

dspinellis commented 1 month ago

When running the following command:

 a3k --debug progress populate datacite.db datacite datacite.tar.gz

late in the process the following error occurs:

Container 28856 10.17031/part_00000.jsonl
Container 28857 10.17031/part_00001.jsonl
Traceback (most recent call last):
  File "/home/dds/src/alexandria3k/examples/datacite/../../bin/a3k", line 35, in <module>
    sys.exit(main())
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/__main__.py", line 602, in main
    more helpful message."""
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/__main__.py", line 591, in error_raising_main
    args.func(args)
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/__main__.py", line 183, in populate
    data_source_instance.populate(
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/data_source.py", line 1098, in populate
    populate_table(table, i, condition)
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/data_source.py", line 941, in populate_table
    self.vdb.execute(log_sql(statement))
  File "src/vtable.c", line 2466, in VirtualTable.xColumn
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/data_sources/datacite.py", line 120, in Column
    return super().Column(col)
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/data_source.py", line 259, in Column
    return extract_function(self.current_row_value())
  File "/home/dds/src/alexandria3k/bin/../src/alexandria3k/data_source.py", line 238, in current_row_value
    return self.elements[self.element_index]
KeyError: 0
dspinellis commented 1 month ago

@evgepab is this something obvious that you can easily fix? Otherwise, I'll look at it.

dspinellis commented 1 month ago

The file buggy.tar.gz quickly replicates the problem by running the command bin/a3k populate buggy.db datacite buggy.tar.gz.

dspinellis commented 1 month ago

This seems to be the JSON data that causes the problem.

{
  "container": {},
  "reason": null,
  "formats": [],
  "fundingReferences": [],
  "prefix": "10.17031",
  "creators": [
    {
      "nameType": "Personal",
      "affiliation": {
        "name": "Marine Biological Association"
      },
      "givenName": "Clare",
      "familyName": "Ostle",
      "name": "Clare Ostle"
    }
  ],
  "registered": "2022-12-05T17:08:07Z",
  "language": null,
  "source": "api",
  "suffix": "637b5e4a8d3ae",
  "relatedItems": [],
  "descriptions": [
    {
      "descriptionType": "Abstract",
      "description": "CSV file containing CPR data. Taxa are summed within each grouping (given in headings), monthly means have been calculated for a Northeast Pacific region > 1000 m isobath. Plankton abundance counts are recorded according to standard CPR methodology, see Richardson et al. (2006, Prog in Oceanog 68. 27-74). Units are \"Number of cells per sample” for the phytoplankton groupings, and “Number of organisms per sample” for the zooplankton groupings."
    }
  ],
  "schemaVersion": null,
  "sizes": [],
  "metadataVersion": 0,
  "types": {
    "schemaOrg": "Dataset",
    "resourceTypeGeneral": "Dataset",
    "citeproc": "dataset",
    "bibtex": "misc",
    "ris": "DATA",
    "resourceType": "dataset"
  },
  "isActive": true,
  "relatedIdentifiers": [],
  "created": "2022-12-05T17:08:06Z",
  "identifiers": [],
  "subjects": [],
  "dates": [],
  "published": "2022",
  "titles": [
    {
      "title": "Monthly CPR data grouped in Northeast Pacific region (> 1000 m isobath)"
    }
  ],
  "geoLocations": [],
  "url": "https://doi.mba.ac.uk/data/2956",
  "rightsList": [
    {
      "rightsUri": "https://creativecommons.org/licenses/by-nc/4.0/",
      "rights": "Creative Commons NonCommercial 4.0 International"
    }
  ],
  "publicationYear": 2022,
  "publisher": "The Archive for Marine Species and Habitats Data (DASSH)",
  "contentUrl": null,
  "contributors": [],
  "updated": "2022-12-05T17:08:07Z",
  "doi": "10.17031/637b5e4a8d3ae",
  "alternateIdentifiers": [],
  "state": "findable",
  "version": null
}
dspinellis commented 1 month ago

Note that in correctly loaded data "affiliation" is an array:

  "creators": [
    {
      "nameType": "Personal",
      "affiliation": [
        {
          "name": "Marine Biological Association"
        }
      ],
      "givenName": "David",
      "familyName": "Johns",

whereas in the problematic data it is a dictionary.

evgepab commented 1 month ago

@evgepab is this something obvious that you can easily fix? Otherwise, I'll look at it.

I believe I can look into it! I guess this happens due to the outdated metadata version where they considered that each creator could have only one affiliation.

dspinellis commented 1 month ago

@evgepab is this something obvious that you can easily fix? Otherwise, I'll look at it.

I believe I can look into it! I guess this happens due to the outdated metadata version where they considered that each creator could have only one affiliation.

Thanks! I woke up with a fix in mind, so I'm implementing it.