Closed sphuber closed 2 years ago
This is now ready for review. There is just the failing links to check, but I am not sure what to do with them. A lot of the errors are from external sites (see below for overview). What to do in these cases? Also, is there a nicer overview of the linkcheck output? Or do you really have to scan through each line (which includes successful checks) for the failed ones? Seems like a shit user experience.
https://www.aiida.net/aiida-tutorial-at-vilnius-university-vilnius-lithuania http://www.max-centre.eu/2017/07/18/prize/ http://www.max-centre.eu/max-hackathon/ https://psi-k.net/www.materialscloud.org https://www.cecam.org/wp-content/uploads/2019/04/2019_03_EPFL_materials_science_researcher_software_engineer.pdf https://www.swissuniversities.ch/en/themen/digitalisierung/p-5-wissenschaftliche-information
https://onlinelibrary.wiley.com/doi/10.1002/adma.201906054 https://onlinelibrary.wiley.com/doi/10.1002/advs.201901606 https://onlinelibrary.wiley.com/doi/10.1002/anie.201913024 https://onlinelibrary.wiley.com/doi/10.1002/cphc.201900283 https://pubs.acs.org/doi/10.1021/acs.chemmater.9b02047 https://pubs.acs.org/doi/10.1021/acs.jctc.9b00586 https://pubs.acs.org/doi/10.1021/acs.jpcc.9b05590 https://pubs.acs.org/doi/10.1021/acs.nanolett.9b00865 https://pubs.acs.org/doi/10.1021/acsami.0c01659 https://pubs.acs.org/doi/10.1021/acscentsci.9b00619 https://pubs.acs.org/doi/10.1021/acsami.9b13220 https://pubs.acs.org/doi/10.1021/acsnano.8b07225 https://pubs.acs.org/doi/10.1021/jacs.8b00587 https://pubs.acs.org/doi/10.1021/jacs.8b06210 https://pubs.acs.org/doi/10.1021/jacs.8b10407 https://pubs.acs.org/doi/10.1021/jacs.9b04718 https://pubs.acs.org/doi/10.1021/jacs.9b05319 https://pubs.acs.org/doi/10.1021/jacs.9b05335 https://pubs.acs.org/doi/10.1021/jacs.9b05501 https://aip.scitation.org/doi/full/10.1063/5.0005077 https://pubs.acs.org/doi/10.1021/acscatal.9b04952 https://pubs.acs.org/doi/10.1021/acscentsci.9b00619 https://onlinelibrary.wiley.com/doi/full/10.1002/adfm.202001984 https://onlinelibrary.wiley.com/doi/full/10.1002/advs.201901606 https://pubs.acs.org/doi/10.1021/acs.nanolett.0c02077 https://pubs.acs.org/doi/abs/10.1021/acscentsci.0c00988 https://pubs.acs.org/doi/full/10.1021/acs.chemmater.0c02698 https://pubs.acs.org/doi/full/10.1021/acs.jpcc.0c04596 https://www.sciencedirect.com/science/article/pii/S0167273819309038 https://www.sciencedirect.com/science/article/pii/S092702562030656X https://www.sciencedirect.com/science/article/pii/S2589152920300922
Thanks @sphuber ! I let @chrisjsewell check this PR and comment on the links etc.
503
Service unavailable is probably a temporary error - maybe we should ignore those (just issue warning?) Most of those work for me.
For the 404, I would try to fix them, or maybe for those where the link does not exist anymore, edit the page to remove the link (or convert the link to text only and say "The page used to be at: XXX"
Thanks @sphuber, this is definitely very helpful, however
Any transformations that are missing in the automated script?
there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g. [![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)
.
Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database
Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.
There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.
The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.
there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g.
[![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)
.
Will have a look to add these, either semi-automated or manual.
Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database
Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.
This we can easily distribute. Would be good if others can pitch in and just add them through commits to this PR.
There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.
What conversions were these? If they can be added in the automated script I will add them. I think it is worth something to have everything migrated in a single consistent way.
The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.
Fair. Will keep it in the PR and then when accepted will remove it and post the final version here.
superseded by #25
For posterity, this was the script used to generate the markdown from the to JSON converted SQL database contents:
#!/usr/bin/env python
"""Parse the contents of the MySQL database dump of the original AiiDA website into markdown files.
The contents of the MySQL database were dumped to a ``.sql`` file which was then converted to JSON. From this JSON, the
content of the ``wp_posts`` key were written to the ``data.json`` file. The content of each news entry is HTML which is
converted to markdown using the ``markdownify`` library.
The converted news entries are written as a markdown file to the ``../docs/news/posts`` folder. In addition, a file is
written to the working directory ``urls.json`` which contains a mapping of the old URL to the new one. This will allow
to configure automatic redirects in the web server.
"""
import datetime
import json
import pathlib
import re
import textwrap
import markdownify
def main():
with open("data.json") as handle:
data = json.load(handle)
entries = {}
revisions = {}
url_mapping = {}
for entry in data:
if entry["post_type"] in ["news", "post"]:
entries[entry["ID"]] = entry
elif entry["post_type"] == "revision":
revisions[entry["ID"]] = entry
for revision in revisions.values():
parent_id = revision["post_parent"]
if parent_id not in entries:
continue
title = revision["post_title"]
content = revision["post_content"]
try:
ctime = datetime.datetime.strptime(
revision["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
)
except ValueError:
ctime = datetime.datetime.strptime(
revision["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
)
entries[parent_id].setdefault("revisions", []).append((ctime, title, content))
basepath = pathlib.Path(__file__).parent.parent / "docs" / "news" / "posts"
basepath.mkdir(exist_ok=True, parents=True)
for pk, entry in entries.items():
if "revisions" in entry:
latest_revision = sorted(entry["revisions"], key=lambda x: x[0])[-1]
content = latest_revision[-1]
else:
content = entry["post_content"]
if not content.strip():
continue
try:
ctime = datetime.datetime.strptime(
entry["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
)
except ValueError:
ctime = datetime.datetime.strptime(
entry["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
)
name = (
entry["post_name"]
if entry["post_name"]
else "-".join([e.lower() for e in entry["post_title"].split()])
)
short_name = "-".join(name.split("-")[:4])
title = entry["post_title"]
date = f"{ctime:%Y-%m-%d}"
header = textwrap.dedent(
f"""
---
blogpost: true
category:
tags:
date: {date}
---
"""
).lstrip()
substitutions = (
("\\r\\n", "\n"),
("\\n", "\n"),
("\\", ""),
("“", '"'),
("”", '"'),
)
for pattern, replacement in substitutions:
title = title.replace(pattern, replacement)
content = content.replace(pattern, replacement)
markdown = markdownify.markdownify(content) # Convert HTML to markdown
markdown = re.sub(r"\n\s+\n", "\n\n", markdown) # Remove whitespace lines
markdown = re.sub(r"\s+\n", "\n", markdown) # Remove line trailing whitespace
markdown = re.sub(r"\n+", "\n\n", markdown) # Remove consecutive linebreaks
markdown = markdown.replace(" ", " ") # Remove literal non-breaking spaces
filepath = basepath / f"{ctime:%Y-%m-%d}-{short_name}.md"
with open(filepath, "w") as handle:
handle.write(f"{header}\n")
handle.write(f"# {title}\n\n")
handle.write(f"{markdown.strip()}\n")
url_old = f"news/{name}"
url_new = f'news/posts/{filepath.name.replace(".md", ".html")}'
url_mapping[url_old] = url_new
with open("urls.json", "w") as handle:
json.dump(url_mapping, handle, indent=4)
handle.write("\n")
if __name__ == "__main__":
main()
Add a script to automatically convert the news entries from the MySQL dump of the original database to markdown files. The only thing missing so far as far as I am aware:
data/urls.json
. I saw that the currently mapping is kept atdocs/legacy_redirect.json
. I currently chose to not directly to this file, as in the future it could contain other redirects that were added manually.Think we can add the categories and tags manually but then I think the state of the entries should be pretty close to the manually migrated ones. Any transformations that are missing in the automated script?