WordPress / wordpress-playground

Run WordPress in the browser via WebAssembly PHP
https://w.org/playground/
GNU General Public License v2.0
1.65k stars 261 forks source link

[Data Liberation] WP_Stream_Importer with support for WXR and Markdown files #1982

Closed adamziel closed 4 days ago

adamziel commented 2 weeks ago

Motivation for the change, related issues

Adds WP_Stream_Importer – a generalized importer for arbitrary data. It comes with two data sources:

WP_Stream_Importer

This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:

Entities

This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.

The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.

Multiple passes

Every import will require multiple passes over the stream of entities to:

User input

The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:

If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.

WP_WXR_Reader

Streaming

The WXR reader supports the usual streaming interface with append_bytes(), is_paused_on_incomplete_input() et al.

It also comes with a new connect_upstream( $byte_source ) method that allows it to automatically pull new data chunks from a data source:

$wxr = new WP_WXR_Reader();
$wxr->connect_upstream(
    new WP_File_Reader(__DIR__ . '/tests/fixtures/wxr-simple.xml')
);
while($wxr->next_entity()) {
    $entity = $wxr->get_entity();
    // process
}

This way the consumer code never needs to worry about appending bytes, checking for EOF and such.

This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:

WP_Markdown_Directory_Tree_Reader

This class traverses a directory tree and transforms all the .md files into page entity objects that can be processed by WP_Entity_Importer:

$docs_root = __DIR__ . '/../../docs/site';
$docs_content_root = $docs_root . '/docs';
$entity_iterator_factory = function() use ($docs_content_root) {
    return new WP_Markdown_Directory_Tree_Reader(
        $docs_content_root,
        1000
    );
};
$markdown_importer = WP_Markdown_Importer::create(
    $entity_iterator_factory, [
        'source_site_url' => 'file://' . $docs_content_root,
        'local_markdown_assets_root' => $docs_root,
        'local_markdown_assets_url_prefix' => '@site/',
    ]
);
$markdown_importer->frontload_assets();
$markdown_importer->import_posts();

WP_Markdown_To_Blocks

We don't just save raw Markdown data as post_content. Not at all!

This PR ships a WP_Markdown_To_Blocks class that:

Other stuff

This PR also:

Follow-up work

Testing instructions

Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.