Automatically converted news entries to markdown format

sphuber commented 2 years ago

Add a script to automatically convert the news entries from the MySQL dump of the original database to markdown files. The only thing missing so far as far as I am aware:

Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database
Missing tags: same as for the category
The URL mapping is currently written to data/urls.json. I saw that the currently mapping is kept at docs/legacy_redirect.json. I currently chose to not directly to this file, as in the future it could contain other redirects that were added manually.

Think we can add the categories and tags manually but then I think the state of the entries should be pretty close to the manually migrated ones. Any transformations that are missing in the automated script?

sphuber commented 2 years ago

This is now ready for review. There is just the failing links to check, but I am not sure what to do with them. A lot of the errors are from external sites (see below for overview). What to do in these cases? Also, is there a nicer overview of the linkcheck output? Or do you really have to scan through each line (which includes successful checks) for the failed ones? Seems like a shit user experience.

404

https://www.aiida.net/aiida-tutorial-at-vilnius-university-vilnius-lithuania http://www.max-centre.eu/2017/07/18/prize/ http://www.max-centre.eu/max-hackathon/ https://psi-k.net/www.materialscloud.org https://www.cecam.org/wp-content/uploads/2019/04/2019_03_EPFL_materials_science_researcher_software_engineer.pdf https://www.swissuniversities.ch/en/themen/digitalisierung/p-5-wissenschaftliche-information

503

giovannipizzi commented 2 years ago

Thanks @sphuber ! I let @chrisjsewell check this PR and comment on the links etc. 503 Service unavailable is probably a temporary error - maybe we should ignore those (just issue warning?) Most of those work for me.

For the 404, I would try to fix them, or maybe for those where the link does not exist anymore, edit the page to remove the link (or convert the link to text only and say "The page used to be at: XXX"

https://www.aiida.net/aiida-tutorial-at-vilnius-university-vilnius-lithuania -> Where is this? This was replaced with https://www.aiida.net/aiida-virtual-tutorial-july-2020/ due to COVID, maybe we should also adapt the text in the corresponding page to mention this. http://www.max-centre.eu/2017/07/18/prize/ -> seems that it does not exist anymore. Mention as text the old link, or we could link to this https://nccr-marvel.ch/news/communication/max-prize-call-2017 even if it's maybe weird to link to a different project... http://www.max-centre.eu/max-hackathon/ -> A bit ugly, but maybe link to this http://www.max-centre.eu/news-events/max-hackathon (anyway they also have a broken link) https://psi-k.net/www.materialscloud.org -> Not sure what this is, maybe it's a typo in the original URL and just wanted to be a link to materials cloud? https://www.cecam.org/wp-content/uploads/2019/04/2019_03_EPFL_materials_science_researcher_software_engineer.pdf -> I would remove the link to the old ad (replace with "the position is now closed" or similar) https://www.swissuniversities.ch/en/themen/digitalisierung/p-5-wissenschaftliche-information -> replace with https://www.swissuniversities.ch/en/p-5-services

chrisjsewell commented 2 years ago

Thanks @sphuber, this is definitely very helpful, however

Any transformations that are missing in the automated script?

there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g. [![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png).

Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database

Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.

There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.

The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.

sphuber commented 2 years ago

there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g. [![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png).

Will have a look to add these, either semi-automated or manual.

Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database

Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.

This we can easily distribute. Would be good if others can pitch in and just add them through commits to this PR.

There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.

What conversions were these? If they can be added in the automated script I will add them. I think it is worth something to have everything migrated in a single consistent way.

The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.

Fair. Will keep it in the PR and then when accepted will remove it and post the final version here.

chrisjsewell commented 2 years ago

superseded by #25

sphuber commented 2 years ago

For posterity, this was the script used to generate the markdown from the to JSON converted SQL database contents:

#!/usr/bin/env python
"""Parse the contents of the MySQL database dump of the original AiiDA website into markdown files.

The contents of the MySQL database were dumped to a ``.sql`` file which was then converted to JSON. From this JSON, the
content of the ``wp_posts`` key were written to the ``data.json`` file. The content of each news entry is HTML which is
converted to markdown using the ``markdownify`` library.

The converted news entries are written as a markdown file to the ``../docs/news/posts`` folder. In addition, a file is
written to the working directory ``urls.json`` which contains a mapping of the old URL to the new one. This will allow
to configure automatic redirects in the web server.
"""
import datetime
import json
import pathlib
import re
import textwrap

import markdownify

def main():
    with open("data.json") as handle:
        data = json.load(handle)

    entries = {}
    revisions = {}
    url_mapping = {}

    for entry in data:
        if entry["post_type"] in ["news", "post"]:
            entries[entry["ID"]] = entry
        elif entry["post_type"] == "revision":
            revisions[entry["ID"]] = entry

    for revision in revisions.values():
        parent_id = revision["post_parent"]

        if parent_id not in entries:
            continue

        title = revision["post_title"]
        content = revision["post_content"]

        try:
            ctime = datetime.datetime.strptime(
                revision["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
            )
        except ValueError:
            ctime = datetime.datetime.strptime(
                revision["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
            )

        entries[parent_id].setdefault("revisions", []).append((ctime, title, content))

    basepath = pathlib.Path(__file__).parent.parent / "docs" / "news" / "posts"
    basepath.mkdir(exist_ok=True, parents=True)

    for pk, entry in entries.items():

        if "revisions" in entry:
            latest_revision = sorted(entry["revisions"], key=lambda x: x[0])[-1]
            content = latest_revision[-1]
        else:
            content = entry["post_content"]

        if not content.strip():
            continue

        try:
            ctime = datetime.datetime.strptime(
                entry["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
            )
        except ValueError:
            ctime = datetime.datetime.strptime(
                entry["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
            )

        name = (
            entry["post_name"]
            if entry["post_name"]
            else "-".join([e.lower() for e in entry["post_title"].split()])
        )
        short_name = "-".join(name.split("-")[:4])
        title = entry["post_title"]
        date = f"{ctime:%Y-%m-%d}"
        header = textwrap.dedent(
            f"""
            ---
            blogpost: true
            category:
            tags:
            date: {date}
            ---
            """
        ).lstrip()

        substitutions = (
            ("\\r\\n", "\n"),
            ("\\n", "\n"),
            ("\\", ""),
            ("“", '"'),
            ("”", '"'),
        )

        for pattern, replacement in substitutions:
            title = title.replace(pattern, replacement)
            content = content.replace(pattern, replacement)

        markdown = markdownify.markdownify(content)  # Convert HTML to markdown
        markdown = re.sub(r"\n\s+\n", "\n\n", markdown)  # Remove whitespace lines
        markdown = re.sub(r"\s+\n", "\n", markdown)  # Remove line trailing whitespace
        markdown = re.sub(r"\n+", "\n\n", markdown)  # Remove consecutive linebreaks
        markdown = markdown.replace(" ", " ")  # Remove literal non-breaking spaces

        filepath = basepath / f"{ctime:%Y-%m-%d}-{short_name}.md"

        with open(filepath, "w") as handle:
            handle.write(f"{header}\n")
            handle.write(f"# {title}\n\n")
            handle.write(f"{markdown.strip()}\n")

        url_old = f"news/{name}"
        url_new = f'news/posts/{filepath.name.replace(".md", ".html")}'
        url_mapping[url_old] = url_new

    with open("urls.json", "w") as handle:
        json.dump(url_mapping, handle, indent=4)
        handle.write("\n")

if __name__ == "__main__":
    main()

aiidateam / aiida-website

Automatically converted news entries to markdown format #21

404

503