Recommended identifiers to capture for ingested spectra

@kelle and I had a conversation at AAS241 about the different ways data can be harvested and ingested into the database. This conversation touched on all my experiences as an author, reader, and data editor as well as some experieces with the future of data sharing.

After thinking about it, I might recommend capturing two different identifiers for each spectrum harvested into the database. One identifier is the Reference Publication DOI, as even arXiv preprints have DOIs now. The second is the Source PID (persistent identifier). The second one represents where the spectrum came from, which doesn't have to be a DOI; it could be a person or a website.

To illustrate the different ways these two identifiers interact I wrote down some examples. The first set of examples are real ones, using real articles and real data etc. The second set are made up examples, but ones that I know must exist based on paging through this repo's issue discussions.

I hope this is helpful and am interested in the perceived feasibilty of capturing these two IDs during ingest. Before you read through all these examples it may be that your database already captures these two pieces of provenance. In that case apologies for the issue spam! Finally, I think that the issue of archiving copies of the database is a separate (but super important) issue.

Example 1. Data taken from a publication as DbF or supplemental material (same DOIs) Reference Publication: https://doi.org/10.3847/1538-3881/aa9d8a Source PID: https://doi.org/10.3847/1538-3881/aa9d8a

Example 2. Data provided by an author to your database after publication Reference Publication: https://doi.org/10.1088/0004-6256/137/2/3345 Source PID: https://orcid.org/0000-0002-1821-0650 [submitted]

Example 3. Data published in Reference publication but deposited by author somewhere else (e.g, not DbFs) then harvested to your archive from that source (using DOIs -- this is a real example) Reference Publication: https://doi.org/10.1093/mnras/stac1412 Source PID: https://doi.org/10.5281/zenodo.6082001

Example 4. Data published in a Reference publication but deposited on a website and scraped into your database from that source. Reference Publication: https://doi.org/10.1086/324033 Source PID: https://web.archive.org/web/20220130143250/http://web.mit.edu/ajb/www/tdwarf/

The following examples don't have actual working URLs but I'm sure these outcomes exist:

Example 5. Data deposited to an archive, harvested to your database, but has no reference publication. Reference Publication: Author et al. (unpublished) Source PID: https://doi.org/10.5555/foobar.9999v2

Example 6. Data un-published anywhere else but submited by a person because they felt guilty never publishing a really useful spectrum and don't understand repositories. Reference Publication: Author et al. (unpublished) Source PID: https://orcid.org/0000-0003-0666-6367 [submitted]

Example 7. Data un-published anywhere else but scraped from a website. Reference Publication: Author et al. (unpublished) Source PID: https://web.archive.org/web/20630405143250/http://example.com/spectra/KellesBestSpectrumof40Eridani.csv

SIMPLE-AstroDB / SIMPLE-db

Recommended identifiers to capture for ingested spectra #338