aiidateam / aiida-website

The primary website for AiiDA
https://aiida.net
MIT License
1 stars 7 forks source link

Automatically converted news entries to markdown format #21

Closed sphuber closed 2 years ago

sphuber commented 2 years ago

Add a script to automatically convert the news entries from the MySQL dump of the original database to markdown files. The only thing missing so far as far as I am aware:

Think we can add the categories and tags manually but then I think the state of the entries should be pretty close to the manually migrated ones. Any transformations that are missing in the automated script?

sphuber commented 2 years ago

This is now ready for review. There is just the failing links to check, but I am not sure what to do with them. A lot of the errors are from external sites (see below for overview). What to do in these cases? Also, is there a nicer overview of the linkcheck output? Or do you really have to scan through each line (which includes successful checks) for the failed ones? Seems like a shit user experience.

404

https://www.aiida.net/aiida-tutorial-at-vilnius-university-vilnius-lithuania http://www.max-centre.eu/2017/07/18/prize/ http://www.max-centre.eu/max-hackathon/ https://psi-k.net/www.materialscloud.org https://www.cecam.org/wp-content/uploads/2019/04/2019_03_EPFL_materials_science_researcher_software_engineer.pdf https://www.swissuniversities.ch/en/themen/digitalisierung/p-5-wissenschaftliche-information

503

https://onlinelibrary.wiley.com/doi/10.1002/adma.201906054 https://onlinelibrary.wiley.com/doi/10.1002/advs.201901606 https://onlinelibrary.wiley.com/doi/10.1002/anie.201913024 https://onlinelibrary.wiley.com/doi/10.1002/cphc.201900283 https://pubs.acs.org/doi/10.1021/acs.chemmater.9b02047 https://pubs.acs.org/doi/10.1021/acs.jctc.9b00586 https://pubs.acs.org/doi/10.1021/acs.jpcc.9b05590 https://pubs.acs.org/doi/10.1021/acs.nanolett.9b00865 https://pubs.acs.org/doi/10.1021/acsami.0c01659 https://pubs.acs.org/doi/10.1021/acscentsci.9b00619 https://pubs.acs.org/doi/10.1021/acsami.9b13220 https://pubs.acs.org/doi/10.1021/acsnano.8b07225 https://pubs.acs.org/doi/10.1021/jacs.8b00587 https://pubs.acs.org/doi/10.1021/jacs.8b06210 https://pubs.acs.org/doi/10.1021/jacs.8b10407 https://pubs.acs.org/doi/10.1021/jacs.9b04718 https://pubs.acs.org/doi/10.1021/jacs.9b05319 https://pubs.acs.org/doi/10.1021/jacs.9b05335 https://pubs.acs.org/doi/10.1021/jacs.9b05501 https://aip.scitation.org/doi/full/10.1063/5.0005077 https://pubs.acs.org/doi/10.1021/acscatal.9b04952 https://pubs.acs.org/doi/10.1021/acscentsci.9b00619 https://onlinelibrary.wiley.com/doi/full/10.1002/adfm.202001984 https://onlinelibrary.wiley.com/doi/full/10.1002/advs.201901606 https://pubs.acs.org/doi/10.1021/acs.nanolett.0c02077 https://pubs.acs.org/doi/abs/10.1021/acscentsci.0c00988 https://pubs.acs.org/doi/full/10.1021/acs.chemmater.0c02698 https://pubs.acs.org/doi/full/10.1021/acs.jpcc.0c04596 https://www.sciencedirect.com/science/article/pii/S0167273819309038 https://www.sciencedirect.com/science/article/pii/S092702562030656X https://www.sciencedirect.com/science/article/pii/S2589152920300922

giovannipizzi commented 2 years ago

Thanks @sphuber ! I let @chrisjsewell check this PR and comment on the links etc. 503 Service unavailable is probably a temporary error - maybe we should ignore those (just issue warning?) Most of those work for me.

For the 404, I would try to fix them, or maybe for those where the link does not exist anymore, edit the page to remove the link (or convert the link to text only and say "The page used to be at: XXX"

chrisjsewell commented 2 years ago

Thanks @sphuber, this is definitely very helpful, however

Any transformations that are missing in the automated script?

there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g. [![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png).

Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database

Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.

There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.

The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.

sphuber commented 2 years ago

there is indeed a critical omission; all the images are only link to the existing aiida.net, which will obviously break once moved, e.g. [![Results of the feedback form for the AiiDA Coding week (Dec 2016)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png)](http://www.aiida.net/wp-content/uploads/2016/12/aiida_coding_week_2016_results.png).

Will have a look to add these, either semi-automated or manual.

Missing category: this seems to have been added manually in the entries that were migrated manually. This information doesn't seem to be present in the database

Yes indeed, all the categories are manually added, but they are critical for the working of the new blog sections, so I can't accept this PR without them being added.

This we can easily distribute. Would be good if others can pitch in and just add them through commits to this PR.

There were also other manual changes to the existing documents that have been added, which is not covered by these auto-conversions. So I would ask if you could also remove all auto-conversions for all existing documents.

What conversions were these? If they can be added in the automated script I will add them. I think it is worth something to have everything migrated in a single consistent way.

The automated script itself, also does not need to be added to the repository, since this is a one-time operation, so we can just post the code in this PR, for posterity.

Fair. Will keep it in the PR and then when accepted will remove it and post the final version here.

chrisjsewell commented 2 years ago

superseded by #25

sphuber commented 2 years ago

For posterity, this was the script used to generate the markdown from the to JSON converted SQL database contents:

#!/usr/bin/env python
"""Parse the contents of the MySQL database dump of the original AiiDA website into markdown files.

The contents of the MySQL database were dumped to a ``.sql`` file which was then converted to JSON. From this JSON, the
content of the ``wp_posts`` key were written to the ``data.json`` file. The content of each news entry is HTML which is
converted to markdown using the ``markdownify`` library.

The converted news entries are written as a markdown file to the ``../docs/news/posts`` folder. In addition, a file is
written to the working directory ``urls.json`` which contains a mapping of the old URL to the new one. This will allow
to configure automatic redirects in the web server.
"""
import datetime
import json
import pathlib
import re
import textwrap

import markdownify

def main():
    with open("data.json") as handle:
        data = json.load(handle)

    entries = {}
    revisions = {}
    url_mapping = {}

    for entry in data:
        if entry["post_type"] in ["news", "post"]:
            entries[entry["ID"]] = entry
        elif entry["post_type"] == "revision":
            revisions[entry["ID"]] = entry

    for revision in revisions.values():
        parent_id = revision["post_parent"]

        if parent_id not in entries:
            continue

        title = revision["post_title"]
        content = revision["post_content"]

        try:
            ctime = datetime.datetime.strptime(
                revision["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
            )
        except ValueError:
            ctime = datetime.datetime.strptime(
                revision["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
            )

        entries[parent_id].setdefault("revisions", []).append((ctime, title, content))

    basepath = pathlib.Path(__file__).parent.parent / "docs" / "news" / "posts"
    basepath.mkdir(exist_ok=True, parents=True)

    for pk, entry in entries.items():

        if "revisions" in entry:
            latest_revision = sorted(entry["revisions"], key=lambda x: x[0])[-1]
            content = latest_revision[-1]
        else:
            content = entry["post_content"]

        if not content.strip():
            continue

        try:
            ctime = datetime.datetime.strptime(
                entry["post_date_gmt"], "%Y-%m-%d %H:%M:%S"
            )
        except ValueError:
            ctime = datetime.datetime.strptime(
                entry["post_modified_gmt"], "%Y-%m-%d %H:%M:%S"
            )

        name = (
            entry["post_name"]
            if entry["post_name"]
            else "-".join([e.lower() for e in entry["post_title"].split()])
        )
        short_name = "-".join(name.split("-")[:4])
        title = entry["post_title"]
        date = f"{ctime:%Y-%m-%d}"
        header = textwrap.dedent(
            f"""
            ---
            blogpost: true
            category:
            tags:
            date: {date}
            ---
            """
        ).lstrip()

        substitutions = (
            ("\\r\\n", "\n"),
            ("\\n", "\n"),
            ("\\", ""),
            ("“", '"'),
            ("”", '"'),
        )

        for pattern, replacement in substitutions:
            title = title.replace(pattern, replacement)
            content = content.replace(pattern, replacement)

        markdown = markdownify.markdownify(content)  # Convert HTML to markdown
        markdown = re.sub(r"\n\s+\n", "\n\n", markdown)  # Remove whitespace lines
        markdown = re.sub(r"\s+\n", "\n", markdown)  # Remove line trailing whitespace
        markdown = re.sub(r"\n+", "\n\n", markdown)  # Remove consecutive linebreaks
        markdown = markdown.replace(" ", " ")  # Remove literal non-breaking spaces

        filepath = basepath / f"{ctime:%Y-%m-%d}-{short_name}.md"

        with open(filepath, "w") as handle:
            handle.write(f"{header}\n")
            handle.write(f"# {title}\n\n")
            handle.write(f"{markdown.strip()}\n")

        url_old = f"news/{name}"
        url_new = f'news/posts/{filepath.name.replace(".md", ".html")}'
        url_mapping[url_old] = url_new

    with open("urls.json", "w") as handle:
        json.dump(url_mapping, handle, indent=4)
        handle.write("\n")

if __name__ == "__main__":
    main()