OrchardCMS / OrchardCore

Orchard Core is an open-source modular and multi-tenant application framework built with ASP.NET Core, and a content management system (CMS) built on top of that framework.
https://orchardcore.net
BSD 3-Clause "New" or "Revised" License
7.38k stars 2.38k forks source link

Import from WordPress #4418

Open DannyT opened 5 years ago

DannyT commented 5 years ago

This will take a long time for me to finish, but I'm up for taking on the challenge of creating a WordPress import module. I am keen on migrating my personal site from WP to OC but have posts dating back to 2006 and don't really want to manually do anything and might as well learn some OC in the process.

Priorities for me are Posts and Media, I'm personally not too fussed about pages, menus, categories, tags, comments etc but I wouldn't want to eliminate the possibility of them for others in the future.

As far as I've thought about this so far, there are a few approaches available:

Export from WP (xml) > Manipulate xml into OC recipe JSON > Import using OC Import

The work here is done entirely outside of Orchard and could just be an executable, xslt transform or some other file read/write process. Pragmatic but doesn't sound very fun or rely on learning much OC...

Module to consume WP Rest API and create content via OC API

This feels like it would enable pretty granular control but maybe requires a bit more work and fragility to maintain?

Combination of the above

I like the idea of using the WP Rest API for user simplicity (just paste in a WP URL) but supporting file import would be handy for any sites not publicly accessible. Using the existing recipe format feels like a sensible approach to make use of all the existing functionality but would be nice to be able to invoke this from code assuming that's possible? For large sites though perhaps doing something more iteratively that could report progress would be useful.

UI wise, being able to simply enter in a WP site url and hit a button would be ideal, possibly with a verification step "You're about to import x blog posts, are you sure?" and a preview of some of the data incoming.

I'm really just guessing at the moment, so anyone with more insight/experience weighing on the best approach would be appreciated :)

hishamco commented 5 years ago

Hi Dan

This is cool as a community project unless there's a module that abstract this out, coz many I import from WordPress, DNN, Umbraco .. etc I know it's crazy but it's doable

I just create a tool to migrate e107 News to WordPress posts few months ago, it was a great experience

I can think of this in two ways:

1- Export your WP posts into Orchard Core spec format, then you can easily use Import/Export module

2- Create a custom module connect to WP database to access the wp_posts and extract its data to OrchardCore

sebastienros commented 5 years ago

That's a very nice initiative. I have done that by the past for other systems. The main one being for https://weblogs.asp.net that was migrated from another aspnet webforms solution. I also did a tool that would export an Orchard site into a WordPress one.

I think the most important thing is reliability and repeatability, and idempotency. The idea is that you can't fail in the middle of a migration, and have a site that is corrupted (the OC one). Also running the migration multiple times on the same site should just work without duplicating content (idempotency) in case something goes wrong. Because of that I assume the easiest is to use an intermediate recipe for sure. This pivot would allow to generate the recipe from the REST API or the file export. And to run it from an existing site, or push it using the deployment endpoints.

The generation of the recipe should be deterministic, such that the content ids remain identical if you re-run the conversion. With that you would then be able to run the migration many times, and it would just update the new content elements if some have already been imported. And if you want to run the migration many times, it's better if the extraction of the content from WP and the conversion are two different steps, so you can export once, and iterate on the conversion many times, hence using the XML export format.

For the images, this could be done while the conversion is done, by downloading any asset that is not yet existing locally. But if the site is not accessible it will require some manual job to copy the files through ftp. I am sure some WP plugins must exist though to do that without FTP.

I think you can't ignore Pages and Comments. Pages should be very simple to handle, or at least not much work once all is done for blog posts. Comments might be trickier because we don't have a module yet, but it should provide a solution to export to Disqus at least, and maybe just by using the WP Disqus export plugin. And Tags/Categories can't be lost when migrating a Blog.

Another very important thing is to be able to do it for many sites. So being able to script it is very important. Even if it means scripting the export, then scripting the conversion and scripting the import.

Other things to consider:

Finally, if you decide to just go with the custom wizard directly within OC to point to an existing WP site, then it would still be a great solution, at least for a single blog.

agriffard commented 5 years ago

New import feature from other formats would be indeed a great addition in order to convince people to migrate to OC. It could be also interesting if we could directly specify a RSS feed or import one and generate the corresponding blog and posts.

DannyT commented 5 years ago

I've made some progress and have a simple console app that will consume a WP export and create a list of arbitrary WPPost POCOs. I've created a simple OC recipe that represents my intended format and am working on automatically creating this from the wp posts.

Originally I was creating more POCOs that replicate the recipe format with a view to serializing them to matching json. However, this is getting pretty tedious because there is so much nesting and various object types in a recipe and all I'm really doing is adding the posts with everything else staying more or less the same. Would it be particularly bad of me to just use a stringbuilder and insert/replacing the relevant content from WP?

sebastienros commented 4 years ago

Using JObject classes directly to build the json. With LINQ it should be very easy.

Here are some examples: https://www.newtonsoft.com/json/help/html/CreateJsonDeclaratively.htm

DannyT commented 4 years ago

I'm nearly there with a rough first pass at this. The main thing I need is a strategy for idempotency for Content Ids. In WordPress, the Post Ids are ints so I need a method for converting an int to the string format OC expects with a repeatable value.

I'm assuming I can't just use the string representation of the int (will double check this but thought I'd get the question out there for now).

Any pointers would be appreciated :)

hishamco commented 4 years ago

In WordPress, the Post Ids are ints so I need a method for converting an int to the string format OC expects with a repeatable value.

Orchard Core is using sort of guid, have a look to IdGenerator

deanmarcussen commented 4 years ago

@DannyT Sounds very promising.

IdGenerator does generate a guid and then url friendlies it.

However for your rough pass, I suspect you can actually just create a string representation of the int, and it will work fine - OC shouldn't be dependant on the default guid generation, it is just a string, that you should be able to make the id anyway you want, i.e int.ToString()

I haven't tried this though, so give it a crack and see if it works.

Then later maybe you might just pad it out with a string const, so it looks a bit like what OC would produce

DannyT commented 4 years ago

Correct @deanmarcussen, that did indeed just work 👍

Have now got posts importing (incl drafts) via a recipe into the default TheBlog Theme, just tested with some 250 real posts without issue and repeatable too.

Still have to:

DannyT commented 4 years ago

Still plugging away at this but wanted to canvas opinion. WordPress has very explicitly defined "Categories" and "Tags". Currently in OC we have a Taxonomies module which has quite a bit of discussion around it but seemingly no consensus as yet: https://github.com/OrchardCMS/OrchardCore/issues/2468

What are people's opinion on what we should do with the incoming categories and tags from WP? Would it make sense to create two taxonomies, one called Categories and another Tags?

@sebastienros you're probably sick of thinking about taxonomies/tags and I can see you changing your mind on it 3 times in that thread... what's your current position? 😛

sebastienros commented 4 years ago

I believe I only changed my mind once ;) From wanting a Tags module to agreeing that it should be a taxonomy. I only encouraged Arra to work on the tags because he would be autonomous this way, but I still want it to be based on Taxonomies to be part of OC.

So my suggestion is to create two taxonomies, Categories and Tags. It's just that right now you won't have a good way to "render" items based on a term because it's not supported oob. But I'd still go with that as this is the best long term plan.

sebastienros commented 4 years ago

Maybe I already mentioned that, please into account url mapping rule so the source urls could be converted to a new based one. Like //foo.com/bar could be mapped to //bar.com/foo. And if possible generate a file (json or text) that contains all these conversions if one has to build some kind of redirection script for SEO.

DannyT commented 4 years ago

Noted re taxonomies 👍

For redirects, currently I'm maintaining the original site permalink/autoroute structure. I could add in the ability to change them and create a redirect but would that make more sense as a separate module (e.g. "Bulk alias update + redirect")? That would then be usable say, if you were doing an overhaul of an existing site without migrating content between platforms...

sebastienros commented 4 years ago

It should/could be a different module (you have one I assume for this) but having some file with the data would be a step into configuring this module. And if this module accepts some kind of recipe step then it could output the recipe step with all these redirects.

DannyT commented 4 years ago
HTTP Error 404.13 - Not Found
The request filtering module is configured to deny a request that exceeds the request content length.

Most likely causes:
Request filtering is configured on the Web server to deny the request because the content length exceeds the configured value.

Can anyone offer a solution for this? I can see there's been some discussion and evolution of how this is handled for media specifically but not for imports. I can edit the server config by adding a web.config to my project but the fact there isn't one already makes me think that's not the desired practice?

deanmarcussen commented 4 years ago

Yes, for media we have an attribute which allows aspnetcore to extend the formrequest.formlength > 128mb for uploading large media files.

But for this to work with IIS / IIS Express you also need to update the web.config settings, as that will apply a limit before aspnetcore gets a hold of it.

Equally I'm not sure from the message you have above what you actually running this with? dotnet run or VS w/IISExpress

Normally if it's related to the formoptions.formlength there will be an exception as well, possibly you'll only see that in your log files though, so might be worth a check?

How big are the files you're actually trying to upload (there's a few other request limit that might be being applied by aspnetcore, which have different sizes by default)?

I have a workaround I can give you for the form limits issue (if it's that), but we also may need a setting on the import controller for that, so it's configurable

DannyT commented 4 years ago

Thanks @deanmarcussen the web.config approach I linked worked (VS with IIS Express) so I'll stick with that, just wanted to check if there was a more OC-expected way.

Nothing in the logs, though 🤷‍♂️

deanmarcussen commented 4 years ago

Cool 😄 you're probably hitting iisexpress limits rather than aspnetcore limits (so nothing will log). 128mb is the default aspnetcore formlength limit

DannyT commented 4 years ago

Still working on this, when I get the time 😬 I've upgraded to RC1 but as a result I am back to having filesize issues again when I upload a large recipe.

@deanmarcussen I don't suppose you can point me in the right direction as I see you've contributed to seemingly related issues from my searching...

BadHttpRequestException: Request body too large.
Microsoft.AspNetCore.Server.IIS.BadHttpRequestException.Throw(RequestRejectionReason reason)
Microsoft.AspNetCore.Server.IIS.Core.IISHttpContext.InitializeRequestIO()
Microsoft.AspNetCore.Server.IIS.Core.IISHttpContext.ReadAsync(Memory<byte> memory, CancellationToken cancellationToken)
System.Runtime.CompilerServices.ValueTaskAwaiter<TResult>.GetResult()
Microsoft.AspNetCore.Server.IIS.Core.HttpRequestStream.ReadAsyncInternal(Memory<byte> buffer, CancellationToken cancellationToken)
Microsoft.AspNetCore.WebUtilities.BufferedReadStream.EnsureBufferedAsync(int minCount, CancellationToken cancellationToken)
Microsoft.AspNetCore.WebUtilities.MultipartReaderStream.ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
Microsoft.AspNetCore.WebUtilities.StreamHelperExtensions.DrainAsync(Stream stream, ArrayPool<byte> bytePool, Nullable<long> limit, CancellationToken cancellationToken)
Microsoft.AspNetCore.WebUtilities.MultipartReader.ReadNextSectionAsync(CancellationToken cancellationToken)
Microsoft.AspNetCore.Http.Features.FormFeature.InnerReadFormAsync(CancellationToken cancellationToken)
Microsoft.AspNetCore.Antiforgery.DefaultAntiforgeryTokenStore.GetRequestTokensAsync(HttpContext httpContext)
Microsoft.AspNetCore.Antiforgery.DefaultAntiforgery.ValidateRequestAsync(HttpContext httpContext)
Microsoft.AspNetCore.Mvc.ViewFeatures.Filters.ValidateAntiforgeryTokenAuthorizationFilter.OnAuthorizationAsync(AuthorizationFilterContext context)
Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|19_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, object state, bool isCompleted)
Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Logged|17_1(ResourceInvoker invoker)
Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
SixLabors.ImageSharp.Web.Middleware.ImageSharpMiddleware.Invoke(HttpContext context)
Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
OrchardCore.Diagnostics.DiagnosticsStartupFilter+<>c__DisplayClass3_0+<<Configure>b__1>d.MoveNext() in DiagnosticsStartupFilter.cs
Microsoft.AspNetCore.Diagnostics.StatusCodePagesMiddleware.Invoke(HttpContext context)
OrchardCore.Modules.ModularTenantRouterMiddleware.Invoke(HttpContext httpContext) in ModularTenantRouterMiddleware.cs
OrchardCore.Environment.Shell.Scope.ShellScope.UsingAsync(Func<ShellScope, Task> execute) in ShellScope.cs
OrchardCore.Modules.ModularTenantContainerMiddleware.Invoke(HttpContext httpContext) in ModularTenantContainerMiddleware.cs
Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware.Invoke(HttpContext context)
Skrypt commented 4 years ago

Not sure but you can try that one :

https://www.talkingdotnet.com/how-to-increase-file-upload-size-asp-net-core/

deanmarcussen commented 4 years ago

@DannyT yes, what @Skrypt suggested, although that link is missing a couple of other ways to fix it

For more info refer https://github.com/OrchardCMS/OrchardCore/issues/4633

or put this in your main startup projects startup

        services.Configure<FormOptions>(x =>
            {
                x.ValueLengthLimit = Int32.MaxValue;
                x.MultipartBodyLengthLimit = Int64.MaxValue; // In case of multipart
            });

They must have changed the exception handling slightly in aspnetcore as the error looks like it's being rethrown into a BadRequestException, but underlying I think you're just hitting the same FormFeature issue

DannyT commented 4 years ago

I already had the web.config solution from the link @Skrypt shared which did work pre-RC. I fixed it with the following Startup Configure method:

app.Use(async (context, next) =>
            {
                context.Features.Get<IHttpMaxRequestBodySizeFeature>()
                    .MaxRequestBodySize = null;

                await next.Invoke();
            });

Which feels like it's doing the same thing, so not sure why the web.config value is no longer working...

deanmarcussen commented 4 years ago

Right, thanks for the update @DannyT, that's really useful.

Things have changed again with core3, and size limits (it seems ever changing) and I just checked and it has broken the media filter size attribute. I'll open another issue for that and sort it out.

So I think (and still investigating) that they've applied the IHttpMaxRequestBodySizeFeature to HttpSys as well as Kestrel. Last time I worked on it, it only seemed to be happening on kestrel. (or I got it wrong ;) ).

For info the other way to configure this without middleware, for every request is to use

services.Configure<HttpSysOptions>(x =>
            {
                x.MaxRequestBodySize= Int64.MaxValue; // defaults to 30,000,000
            });

And lastly noting that if you go over 128mb you'll also hit the default FormOptions limit.

deanmarcussen commented 4 years ago

Ok, further update. MediaSize attribute is fine, I just failed to configure it correctly.

But yes, what you need is both settings, the web.config which will configure IISExpress or IIS, and then the aspnetcore configuration, which will configure the internal aspnetcore server.

LeonarddeR commented 3 years ago

Is this code testable somewhere? I'm considering importing old wordpress data from an xml into Orchard.

DannyT commented 3 years ago

@leonardder You can have a look at this: https://github.com/DannyT/DannyT.OrchardCoreMigrator

But to be honest I've not touched it since and it wasn't the finest code in the first place. I do intend to pick it back up and get my neglected blog over to OC eventually though....

Meligy commented 3 years ago

Hi there, What's the state of the art in this area please? Is there any good option?

Thanks heaps.

lampersky commented 2 years ago

Maybe somebody find it useful, I've created simple app using Workflows module which is able to perform basic import. https://github.com/OrchardCMS/OrchardCore/discussions/11516#discussioncomment-2703296