CatalogueOfLife / data

Repository for COL content
8 stars 2 forks source link

Develop metadata patch from June edition to apply to July edition #150

Closed yroskov closed 4 years ago

yroskov commented 4 years ago

@mdoering Markus, thinking about re-import & re-sync of all GSDs: Few months ago, I have corrected metadata in the Clearinghouse for many GSDs (names, versions, release dates). For example, our first monthly Beta contained a lot of mistakes in those metadata (some metadata were incorrectly taken from ac19 GSDs). We need to avoid corruption in metadata. Could you please preserve and re-apply metadata of June 2020 to all re-imported GSDs in the Clearinghouse.

gdower commented 4 years ago

@mdoering, I could write an API script to copy the metadata from June monthly release 2140 to draft 3.

mdoering commented 4 years ago

where exactly did you fix the metadata, Yuri? in the source datasets metadata (which will get overwritten and already has for the ones that went through) or the release itself (in which case they are archived)?

We have setup the system to use metadata patches for projects to override metadata. Please never change the original source metadata as it is designed to be not in control of a specific project.

@gdower as there is no UI to manage the metadata patches (@thomasstjerne), should we extract the June metadata from the release and create a patch for them through the API with scripts? @yroskov do you remember which fields were needed to be updated? Creating a patch for all fields would mean they are enternally the same which I dont think we want. Is it title, alias and authors?

yroskov commented 4 years ago

I have made all metadata changes in the Clearinghouse

gdower commented 4 years ago

I could compare the metadata between June vs what is currently in the Clearinghouse and develop a metadata patch for just the changed data.

mdoering commented 4 years ago

sounds like a plan to me. I might need to add a new AP method to read the archived metadata for a release

yroskov commented 4 years ago

From my personal log:

_2020-04-07

Incorrect Release Date in 14 GSDs CatalogueOfLife/backend#120 https://github.com/CatalogueOfLife/data/issues/120

Incorrect Release Date as "2020-04-03" (= CoL conversion date) occur in 14 GSDs (http://dev4.species.id:9191/col_plus/). @gdower reported GSD IDs: 9, 12, 15, 19, 29, 34, 55, 63, 69, 70, 78, 128, 175. @yroskov: field Received by CoL is empty in metadata of these GSDs (why?). @yroskov manually corrected dates in all 14 GSDs (correct dates were taken from AC19 & AC18)._

However, it is only a portion of changes. I cannot remember all of them, because I relies on the Clearinghouse as production software with a final view for all data and all necessary technical logs. After resumed monthly releases (2020-02-24), I didn't think that we'll need to start all again from the beginning.

gdower commented 4 years ago

Ah, there is no endpoint for viewing the embedded dataset metadata within a project. I could just take the metedata from the zip export and compare it vs the dataset's metadata in the API.

gdower commented 4 years ago

I'm going to transfer this to data, so that I can put it on the project board for the July release.

mdoering commented 4 years ago

Grrr, we never checked this. But a bug in the release code prevented committing the inserts, so no source metadata has ever been archived! Ive fixed it now and tested on dev, so for the next release we will be good.

But it means I cannot recover the June metadata. @gdower lets try to use the export archive and create patches from that by comparing them against each datasets current metadata.

The API to create patches with is: POST /dataset/3/patch

with a request body +/- like a standard dataset object, e.g. FishBase:

{
  "created": "2019-11-20T11:07:46.255777",
  "createdBy": 0,
  "modified": "2020-07-01T16:22:48.578376",
  "modifiedBy": 103,
  "key": 1010,
  "importAttempt": 7,
  "type": "taxonomic",
  "origin": "external",
  "title": "FishBase",
  "alias": "FishBase",
  "description": "FishBase is a global information system with extensive information on all species and subspecies of fish. In addition to the data contributed to the Catalogue of Life, the original FishBase database also includes descriptive, biological, ecological, physiological and conservation data and more, and onward links to information in many other databases. Data entry and maintenance is done mainly at the FishBase Information and Research Group, Inc. (FIN) in the Philippines since the 1st January 2011 in collaboration with many colleagues and institutions around the world. The team was previously hosted in WorldFish since the start in 1990 (formerly ICLARM then called WorldFish Center between 2001-2013, and now WorldFish since 2013). FishBase is supported by a Consortium of nine institutions around the world that acts as the Scientific Committee for FIN. FishBase is funded mainly by the European Commission, and by several other donors. FishBase uses data and information from the Catalog of Fishes (CofF) developed by W.N. Eschmeyer at the California Academy of Science. In particular, FishBase uses CofF as the most complete and up-to-date fish nomenclator, and as a taxonomic authority list in a collaborative work of synchronization of the two databases.",
  "organisations": [
    "FIN; WorldFish; FAO; IFM-GEOMAR; MNHN; RMCA; NRM; FC-UBC; AUTH; CAFS"
  ],
  "contact": "R Froese (FishBase Consortium Scientific Coordinator)",
  "authorsAndEditors": [
    "Froese R. & Pauly D. (eds)."
  ],
  "version": "Feb 2018",
  "released": "2018-02-15",
  "website": "http://www.fishbase.org",
  "group": "Fishes",
  "confidence": 4,
  "completeness": 99
}

But it should be trimmed down to just the properties that we want to be patched, e.g.:

{
  "key": 1010,
  "title": "FishBase",
  "alias": "FishBase",
  "contact": "R Froese (FishBase Consortium Scientific Coordinator)",
  "authorsAndEditors": [
    "Froese R. & Pauly D. (eds)."
  ]
}
mdoering commented 4 years ago

the key property is required

gdower commented 4 years ago

@yroskov, the metadata patches were applied to prod, so we just need to test it on our next conversion.