optimize xinclude - Githubissues

Conal-Tuohy commented 3 years ago

XInclude of the entire corpus takes > 15 minutes. Is it worth optimizing this? A custom XProc step that performs XInclude and also builds a cache of intermediate results could run fairly cheaply, and have a significant effect to the extent that the documents in the corpus XInclude a lot of copies of resources which themselves XInclude other resources (so that the recursive XIncludes would each be executed only once), and also by effectively memoizing any XPath selectors which the XInclude statements used.

Conal-Tuohy commented 2 years ago

I think probably the best optimisation would be run the xinclude processor over the p5/includes, p5/combo and p5/metadata directories before running over the p5 directory. At the moment, each file in p5 makes about 10 transclusions from various files in those subfolders, and typically each of those transclusions require several more transclusions (primarily from p5/includes), and some of those require a few more again. Allowing the root documents to drive the transclusions recursively means a huge number of transclusions performed many times over. Performing those xincludes starting at the leaves and working up towards the root of the document will reduce the number of transclusions drastically. It should bring the total time down to a few minutes I would think.

We don't actually want to modify the files in p5/include, p5/combo and p5/metadata in place, since the idea is eventually to make the p5 folder (with its subfolders) the source of truth. So some renaming of the existing folders is probably worth doing at this point. Maybe the import from acsproj should go into a new source folder with includes, combo and metadata folders, that would later serve as source files for direct editing, when acsproj is retired? The xinclude pipeline would copy that entire source tree to p5, and then use XInclude to modify the files in place, firstly in p5/includes, then in p5/combo and p5/metadata directories (in either order), and finally in the p5 directory.

What do you think, @jawalsh ?

jawalsh commented 2 years ago

Sounds like a good plan. Please proceed!

John

On Jul 19, 2022, at 1:01 AM, Conal Tuohy @.***> wrote:

This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.

I think probably the best optimisation would be run the xinclude processor over the p5/includes, p5/combo and p5/metadata directories before running over the p5 directory. At the moment, each file in p5 makes about 10 transclusions from various files in those subfolders, and typically each of those transclusions require several more transclusions (primarily from p5/includes), and some of those require a few more again. Allowing the root documents to drive the transclusions recursively means a huge number of transclusions performed many times over. Performing those xincludes starting at the leaves and working up towards the root of the document will reduce the number of transclusions drastically. It should bring the total time down to a few minutes I would think.

We don't actually want to modify the files in p5/include, p5/combo and p5/metadata in place, since the idea is eventually to make the p5 folder (with its subfolders) the source of truth. So some renaming of the existing folders is probably worth doing at this point. Maybe the import from acsproj should go into a new source folder with includes, combo and metadata folders, that would later serve as source files for direct editing, when acsproj is retired? The xinclude pipeline would copy that entire source tree to p5, and then use XInclude to modify the files in place, firstly in p5/includes, then in p5/combo and p5/metadata directories (in either order), and finally in the p5 directory.

What do you think, @jawalshhttps://github.com/jawalsh ?

— Reply to this email directly, view it on GitHubhttps://github.com/Conal-Tuohy/swinburne/issues/12#issuecomment-1188600780, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEXWOPJJU2RB3B622I3YXDVUYZC3ANCNFSM4XP7FGVQ. You are receiving this because you were assigned.Message ID: @.***>

Conal-Tuohy commented 2 years ago

Pre-transcluding the files in the subfolders includes, combo, and metadata did cut the total runtime of the XInclude step by a fair bit (now down to 4:33 minutes on my development VM). That's a big improvement but still not exactly speedy. Shall I go ahead with moving the transclusion into a background thread, as well?

Conal-Tuohy / swinburne

optimize xinclude #12