WikiTransformationProject / wikitraccs-releases

Releases, issues and discussions for WikiTraccs from the Wiki Transformation Project
https://www.wikitransformationproject.com
8 stars 0 forks source link

Confluence transfer stuck near end #46

Closed craigjm closed 11 months ago

craigjm commented 1 year ago

I have a space I am trying to migrate, but it seems like WIkiTraccs gets stuck with 200 pages left and doesn't finish. I can cancel the job at the console, but is there a way to find what the problem is and finish?

Screen Shot 2023-05-17 at 10 28 10 AM
heinrich-ulbricht commented 1 year ago

@craigjm Good question why it skips those. The log (WikiTraccs.GUI/logs/WikiTraccs.debug*.log) might contain error messages that could explain things. But currently there is no convenient way of diagnosing this. The data to diagnose this is technically there but hard to interpret.

Two approaches.

The first approach to finding the missing pages: diffing page IDs manually

Check the number of migrated pages for this space, maybe the progress bar is off but the number of pages not.

This can be done by filtering the Site Pages library by space key, then you can export and further analyze e.g. via Excel.

analysis

Is the number of page IDs equal to 3588? Or higher?

Now the missing pages can be found via the REST endpoint in Confluence, just like WikiTraccs gets them:

https://yourconfluence.com/rest/api/space/ASOCR/content?start=0&limit=1000&type=page - depending on how many pages your Confluence instance is willing to produce with one API call this would have to be called multiple times with increasing start parameter.

The difference in page IDs compared to the Excel is the pages missing. With those page IDs I would search the log for messages, especially any errors.

The second apprach to finding the missing pages: wait for the next release

WikiTraccs should make this easier. A release is scheduled for the coming week. I'll integrate the diffing of what has been migrated and what needs to be migrated into this next release. So the log file will tell which pages are yet to be migrated. This can then be used for further diagnosis.

Ultimately this information should probably go to the space inventory as migration result information.

Sorry that it's not easier at the moment. It should and it will be.

heinrich-ulbricht commented 1 year ago

@craigjm Would you please try the latest release v1.1.1 and have a look at the progress log files for the space in question.

I'd like to know if the progress bar is just off or if there is really something missing. If so, the new progress log files will tell exactly, which pages are missing. This then can be the basis for further investigation.

Just start the migration again and WikiTraccs logs information about all spaces that are marked for migration.

heinrich-ulbricht commented 1 year ago

@craigjm I observed the same behavior in another context.

Those 200 pages have been migrated to SharePoint before and have been updated in Confluence since then.

Then, when running another migration for the same space, WikiTraccs will happily migrate any new pages that were created in Confluence. But it won't overwrite existing pages from a previous migration run, to not overwrite any changes that have been since made on the SharePoint side.

The progress log files added in release v1.1.1 should tell exactly which pages are affected - all 200 should be in the update-state-of-migrated-pages.txt file, marked as needsupdate.

You can delete the SharePoint pages that should be updated and restart the migration. It will create the pages again.

Looking into the future: would a "force-overwrite" mode help? Are you migrating all Confluence pages at once, or are you doing it in waves? Is there a risk to interfere with user-made page edits, or can't they access the site during the migration anyway?

craigjm commented 1 year ago

We are still in the testing phase of these migrations, so I'm migrating one Confluence space at a time so the owner can check it out, knowing that we might need to remigrate to fix issues. I do not expect to make changes in Sharepoint and then migrate again, but people will make changes in Confluence before the final migration.

For this particular space, it is another migration for the Confluence space, but it was to a new site in Sharepoint. Does that count as another migration? I also ended up trying to run it again to see if it would finish.

When I look at Site Page in SharePoint, I see a lot of the problems with multiple Failed Transformations. Some have 100% Text Transferred, but others have less.

Some of the pages are in the -not-yet-migrated-pages log, others are in the -update-state-of-migrated-pages log.

A "force-overwrite" mode would definitely be useful for me, because I will always want to replace what is in Sharepoint with the final migration from Confluence.

heinrich-ulbricht commented 1 year ago

@craigjm This sounds like an interesting source Confluence.

WikiTraccs skips Confluence pages if it finds those pages already present in the target SharePoint site, identified by page ID. If there is no page yet, it should migrate.

The not-yet-migrated pages should be created one after another when starting the migration. They are waiting to be migrated. If they aren't created then something happened, preventing them from being created. Something like #3 comes to mind (too long page titles). I could read this from the log file.

So I see different topics here:

  1. pages that are present in SharePoint, but have been updated in Confluence; those need to be deleted in SharePoint to get an update there, as long as #47 is waiting to be added
  2. pages that are not being migrated, for unknown reason; this should be visible in the log files
  3. Failed Transformations and != 100% Text Transferred Percent - those are probably macros and/or layouts that WikiTraccs does not know how to handle yet; I'd need the storage format of those pages plus the info from the Migration Log column in the site pages library

I'd like to look into any of those topics depending on your time and priorities.

The page Troubleshooting Strategies shows which diagnosis information can be found where and how to get the storage format of pages.

I'd be very interested to look at the log files and storage format of pages with off Text Transferred Percent. Often those share a similarity in structure, or the same macro that WikiTraccs cannot (yet) handle.

craigjm commented 1 year ago

Ok, first I'll upload the log files from the new version. I'll take a look at some pages with < 100% transferred and get the storage format with the info from the site pages library next week. Thanks for the assistance!

heinrich-ulbricht commented 1 year ago

@craigjm Thanks, please send the other log files from the logs directory as well. I'd suggest via email to contact@wikitransformationproject.com as its content is usually not for the public eye.

heinrich-ulbricht commented 1 year ago

@craigjm There is already something interesting in the logs you provided. About ~200 pages are listed twice. For me it looks like WikiTraccs migrated all pages, but somehow there are pages that were marked for migration twice, then once skipped (because already migrated), and thus skewing the count.

What I don't know yet is whether Confluence already provides the duplicates to WikiTraccs, or whether it's happening further down the road.

heinrich-ulbricht commented 1 year ago

@craigjm Would you please run the migration for the large space again using the latest release v1.3.7?

From reading the log I get the impression that all pages were migrated from Confluence to SharePoint, but some pages are coming back doubly from Confluence. WikiTraccs now logs all page IDs it gets from Confluence for each space, and also actively checks for duplicates, so duplicates should show up in the logs. Please send me the logs for this run.

craigjm commented 1 year ago

It got further this time! 3338/3779. I'm mailing the logs over now.

heinrich-ulbricht commented 1 year ago

@craigjm Quick update on the issue: when WikiTraccs asks Confluence for page IDs Confluence returns 400 duplicate page IDs for the "cit" space. Those duplicates mess with the overall bookkeeping of how many pages have been migrated, how many are still due, etc.

The obvious thing to do is to add a duplicate removal step to WikiTraccs, which will require an update. This should fit into the maintenance release planned for next week.

heinrich-ulbricht commented 1 year ago

@craigjm Page de-duplication has been added to the latest release v1.3.13. Could you please check if this makes the progress bar reach its end? At least the behavior should change compared to last time.

craigjm commented 1 year ago

I started a run on the newest release (1.3.13). It looks like progress is still stuck at 3334/3381, but hopefully the logs files I sent have some useful information.

On Sun, Jun 4, 2023 at 3:38 PM Heinrich Ulbricht @.***> wrote:

@craigjm https://github.com/craigjm Page de-duplication has been added to the latest release v1.3.13 https://github.com/WikiTransformationProject/wikitraccs-releases. Could you please check if this makes the progress bar reach its end? At least the behavior should change compared to last time.

— Reply to this email directly, view it on GitHub https://github.com/WikiTransformationProject/wikitraccs-releases/issues/46#issuecomment-1575686723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGUGTMZKLIO4KS4GNGWTVM3XJTPZ7ANCNFSM6AAAAAAYFF54WU . You are receiving this because you were mentioned.Message ID: @.*** .com>

heinrich-ulbricht commented 1 year ago

@craigjm The latest release v1.4.6 contains progress bar improvements. Outdated pages are now skipped. This has the chance - together with the previously added duplicate removal - to push the progress bar to 100%.

heinrich-ulbricht commented 1 year ago

Note: I found an issue in the Atlassian community that is describing duplicate pages being returned, as well as pages being missing. One possible solution from Atlassian support is to rebuild the content index.

github-actions[bot] commented 12 months ago

This issue is stale because it has been open 20 days with no activity. Remove stale label or comment, or this will be closed in 10 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been stalled for 10 days with no activity.