Adds WP_Stream_Importer – a generalized importer for arbitrary data. It comes with two data sources:
WP_WXR_Reader that streams entities from a WXR file
WP_Markdown_Directory_Tree_Reader that turns a markdown directory into page entities
WP_Stream_Importer
This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:
Never insert a database record until all its dependencies are available.
(almost) never post-process database data. For example, replace all the URLs upfront.
Never crash. Instead, tell the user what failures happened and ask them how to proceed (e.g. upload custom image).
Whenever the work is stopped, start the next run at that exact point.
Avoid per-record database lookups, e.g. don't run SELECT * FROM wp_posts WHERE guid = :guid
Clearly communicate progress (x out of y posts imported, x out of y images downloaded, 380MB of huge_file.zip downloaded).
Assume the import will take multiple requests and make everything re-entrant.
Entities
This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.
The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.
Multiple passes
Every import will require multiple passes over the stream of entities to:
Perform topological sort to process the dependencies first
Frontload all static assets
Potentially retry failed downloads
Verify all the files have been downloaded before moving on to inserting posts
User input
The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:
20 images failed to download. Do you want to provide alternative images? Or do you want to remove them from the site and remove any related <img> tags from the content? Because they are referenced in these posts: (list of posts)
Post number 984 already exists in the database. Do you want to overwrite it? Ignore it? Insert as a new one? Manually reconcile the conflict?
Post 985 has a parent_id 23, but there is no such parent. Do you want to set another parent? Or make it a top-level post? Or ignore it?
If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.
WP_WXR_Reader
Streaming
The WXR reader supports the usual streaming interface with append_bytes(), is_paused_on_incomplete_input() et al.
It also comes with a new connect_upstream( $byte_source ) method that allows it to automatically pull new data chunks from a data source:
$wxr = new WP_WXR_Reader();
$wxr->connect_upstream(
new WP_File_Reader(__DIR__ . '/tests/fixtures/wxr-simple.xml')
);
while($wxr->next_entity()) {
$entity = $wxr->get_entity();
// process
}
This way the consumer code never needs to worry about appending bytes, checking for EOF and such.
This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:
WP_File_Reader – streams bytes from a local file
WP_GZ_File_Reader – streams bytes from a gzipped local file
WP_Remote_File_Reader – streams bytes over HTTPS
WP_Remote_File_Ranged_Reader – streams specific byte ranges over HTTPS
WP_Markdown_Directory_Tree_Reader
This class traverses a directory tree and transforms all the .md files into page entity objects that can be processed by WP_Entity_Importer:
We don't just save raw Markdown data as post_content. Not at all!
This PR ships a WP_Markdown_To_Blocks class that:
Parses markdown data using the League\CommonMark library. It supports frontmatter and GitHub-flavored syntax such as tables, but it's also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress core, we may need to roll out our own Markdown parser, or fork the League\CommonMark one and downgrade it to PHP 7.2.
Converts the document tree to block markup.
Sourcer the post title, order, slug etc. from frontmatter.
Other stuff
This PR also:
Enhances the XML parser.
@php-wasm/compile – Adds more Asyncify functions to the PHP WASM Dockerfile
@wp-playground/cli – buffers the downloads to a .partial file to avoid assuming the file is already cached in case the download have failed.
[ ] Implement topological sort of entities before importing them
[ ] Go over @TODOs and implement them
[ ] Scrutinize the pause/resume workflow. Can we avoid exposing string indices? Can we easily feed downstream byte offset into upstream byte reader to later resume reading the file where the last WXR entity started?
Testing instructions
Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.
Motivation for the change, related issues
Adds
WP_Stream_Importer
– a generalized importer for arbitrary data. It comes with two data sources:WP_WXR_Reader
that streams entities from a WXR fileWP_Markdown_Directory_Tree_Reader
that turns a markdown directory intopage
entitiesWP_Stream_Importer
This is a draft of a re-entrant stream importer designed for importing very large datasets with minimal overhead. The few core ideas are:
SELECT * FROM wp_posts WHERE guid = :guid
huge_file.zip
downloaded).Entities
This is a generalized data importer, not a WXR importer. WXR is just one of possible data sources. This design enables importing markdown files, Blogger exports, Tumblr blogs etc. without having to rewrite that data as WXR.
The basic unit of data is an "entity" – a simple PHP array with post, tag, comment etc. data. Entities can be sourced from WXR and Markdown files – the relevant classes are described below.
Multiple passes
Every import will require multiple passes over the stream of entities to:
User input
The proposed importer is not a single "start and forget" device. It could be configured as such, but by default it will require the user to review the process – sometimes multiple times. Here's a few examples of such touchpoints:
<img>
tags from the content? Because they are referenced in these posts: (list of posts)parent_id
23, but there is no such parent. Do you want to set another parent? Or make it a top-level post? Or ignore it?If a webhost would rather avoid asking the user all these questions, the future importer API may enable forcing each of these decision.
WP_WXR_Reader
Streaming
The WXR reader supports the usual streaming interface with
append_bytes()
,is_paused_on_incomplete_input()
et al.It also comes with a new
connect_upstream( $byte_source )
method that allows it to automatically pull new data chunks from a data source:This way the consumer code never needs to worry about appending bytes, checking for EOF and such.
This PR also ships a few byte sources. Shaping more than one helped me notice patterns and propose v1 of the interface:
WP_File_Reader
– streams bytes from a local fileWP_GZ_File_Reader
– streams bytes from a gzipped local fileWP_Remote_File_Reader
– streams bytes over HTTPSWP_Remote_File_Ranged_Reader
– streams specific byte ranges over HTTPSWP_Markdown_Directory_Tree_Reader
This class traverses a directory tree and transforms all the
.md
files intopage
entity objects that can be processed byWP_Entity_Importer
:WP_Markdown_To_Blocks
We don't just save raw Markdown data as
post_content
. Not at all!This PR ships a
WP_Markdown_To_Blocks
class that:League\CommonMark
library. It supports frontmatter and GitHub-flavored syntax such as tables, but it's also bulky and likely not PHP 7.2-compatible. For inclusion in WordPress core, we may need to roll out our own Markdown parser, or fork theLeague\CommonMark
one and downgrade it to PHP 7.2.Other stuff
This PR also:
@php-wasm/compile
– Adds more Asyncify functions to the PHP WASM Dockerfile@wp-playground/cli
– buffers the downloads to a.partial
file to avoid assuming the file is already cached in case the download have failed.Follow-up work
@TODO
s and implement themTesting instructions
Confirm the CI tests pass. This code isn't actually used anywhere yet so there isn't a better way.