Changes based on referenced DOI guidelines also mentioned in #587:
Drop dx from domain portion of DOI links
HTTPS instead of HTTP
Update alphanumeric formats (unlinked strings doi: [doinumber]) to full URL form as mentioned in #587
Add https://doi.org/ in front of DOI numbers (prefixes start with 10.)
Additional changes:
Added a warning in script for DOIs that are not normalized to the guideline above
Warning message currently shows for CAIDA papers that still need to be fixed
Changed DOIs with empty strings "" to null value for consistency
Decisions I did not proceed with:
There are regexes online for checking valid DOIs (like /^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i) but they did not work
DOI prefixes match the repository that the corresponding paper/data is stored in, but I did not manipulate the script to correct DOI exceptions
There is no holistic list or pattern to match a paper with its DOI prefix as various DOI registration platforms hold such data (e.g. Crossref, DataCite)
Ultimately changed the DOIs in the data manually to fix exceptions like links to publishing sites not in the form of doi.org, incorrect DOIs, typos, doi numbers without the corresponding prefix (just the suffix)
Changes based on referenced DOI guidelines also mentioned in #587:
dx
from domain portion of DOI linksdoi: [doinumber]
) to full URL form as mentioned in #587https://doi.org/
in front of DOI numbers (prefixes start with10.
)Additional changes:
""
tonull
value for consistencyDecisions I did not proceed with:
/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
) but they did not workdoi.org
, incorrect DOIs, typos, doi numbers without the corresponding prefix (just the suffix)