apache / lucenenet

Apache Lucene.NET
https://lucenenet.apache.org/
Apache License 2.0
2.24k stars 638 forks source link

Docs - Build/Deploy Automation #282

Closed Shazwazza closed 4 years ago

Shazwazza commented 4 years ago

Building the API documentation files for a given release is currently a manual process and also involves manually updating the websites.

As part of our build/deploy pipeline we need to automate as much of this as possible.

NightOwl888 commented 4 years ago

I just noticed that the docs for lucene-cli in the repository haven't been updated to match the NuGet version. I realize you haven't actually built the Lucene.NET 4.8.0-beta00008 docs yet, but has updating the docs in the repository factored into the deployment process of the docs? Or would it be better if we did it at some other stage in the release?

Shazwazza commented 4 years ago

Good catch, i actually build and deployed the docs a few days ago and was just about to update the website with the new links.

That said, part of the docs building and automation process that I've already made is to be able to have custom token replacements for environment variables within the docs which is a perfect for this scenario. I'll update the docs to use this variable for this page.

I'll keep going with my tasks for now but I can re-generate this specific doc page and re-deploy once everything is done.

NightOwl888 commented 4 years ago

That said, part of the docs building and automation process that I've already made is to be able to have custom token replacements for environment variables within the docs

Can you think of a way we might tackle this while keeping the Markdown doc readable within GitHub on the current version, or is that too big of an ask?

Shazwazza commented 4 years ago

The MD doc will be readable it will just look like this:

dotnet tool install lucene-cli -g --version [EnvVar:LuceneNetVersion]

it isn't going to show the current version though and I can't really make that happen and not really sure that adds much benefit. That would mean that the docs build has to update both the source and resulting files and commit all of that which isn't too nice IMO.

NightOwl888 commented 4 years ago

My concern is if a Google search ends up on this page. GitHub has a nice markdown processor and prints it out in human-readable form, but the page has no link to the "finished" documentation so it might not be obvious how to get there.

But looking at that page again, if the token were made to be human readable, it would be a reasonable tradeoff. Something like

dotnet tool install lucene-cli -g --version <currentVersion>

Do note the problem will go away when we release a production version. At that point, specifying a version will become optional and omitting it will install the latest version of the tool.

Shazwazza commented 4 years ago

Currently it will show up like:

dotnet tool install lucene-cli -g --version [EnvVar:LuceneNetVersion]

Since that is the custom markdown token. I can modify this if it's that important? Just not sure if <> chars can be used.

FYI: I've deployed the updated site with links to the download files and docs and documentation on building the docs/apis. The download links are also fixed and linking through to the correct mirror site, etc...

(you might need to ctrl + f5)

NightOwl888 commented 4 years ago

Currently it will show up like:

dotnet tool install lucene-cli -g --version [EnvVar:LuceneNetVersion]

That format should be fine.

View Source Links

I noticed that the "View Source" links are broken in the latest docs (example). It is going to a personal GitHub account of yours:

https://github.com/Shazwazza/lucenenet/blob/docs-may/src/Lucene.Net.Demo/IndexFiles.cs/#L40

but there is a 404 page.

Shazwazza commented 4 years ago

oops, i'll fix up and must have missed that part in my process/docs/automation.

NightOwl888 commented 4 years ago

Codecs Documentation Links

I am working on documenting the configuration of codecs #266, and I noticed that the documentation on the beta-00008 file is now linking to the wrong document (looks like it was the right document in the beta-00007 document).

This is by first navigating to the API docs home page, then clicking Lucene.Net > Lucene.Net.Codecs under the menu (which looks correct).

Oddly, both of them have an incorrect URL and breadcrumbs (that are pointing to the test framework).

image

It would seem that the correct breadcrumb should be just

image

New "Namespace" Document

It feels like the right place for adding a tutorial-style document would be in the "namespace" documentation (correct me if I am wrong).

However, what I am grappling with is how to update the document so it has the updated information for the .NET way of configuring codecs while fitting into the current document pipeline. Looking at the new API doc generation procedure,

The documentation generation is a complex process because it needs to convert the Java Lucene project's documentation into a usable format to produce the output Lucene.Net's documentation.

The process overview is:

  • Use the JavaDocToMarkdownConverter project within the DocumentationTools.sln solution to run the conversion of the Java Lucene projects docs into a useable format for DocFx. This tool takes uses a release tag output of the Java Lucene project as it's source to convert against the Lucene.Net's source.
  • Run the documentation build script to produce the documentation site
  • Publish the output to the lucenenet-site repository into a correpsonding named version directory

We don't want to manually change the converted resulting markdown files (.md) because they would get overwritten again when the conversion process is re-executed. Therefor to fix any formatting issues or customized output of the project docs, these customizations/fixes/tweaks are built directly in to the conversion process itself in the JavaDocToMarkdownConverter.csproj project.

Just what is the plan to update the documents to eliminate all of the Java-centric instructions and replace them with .NET equivalents? Can we expand the API doc generation document to include an example of just that?

Original Plan for JavaDocToMarkdownConverter

Do note that the original idea behind the JavaDocToMarkdownConverter project was to get it to a point where we are happy that it will automate around 80-90% of the work, and then do a final automated conversion for 4.8.0 followed by a manual cleanup operation to replace the Java-centric text and code examples with its .NET equivalent. The plan was to freeze the automated generation once we were happy with the automated tokens and URL locations so we could then update the .md documents with the correct info.

This automated generation would take place once each time we port a new Lucene version from Java to .NET, and we would need to use some kind of text cleanup procedure to make sure we don't accidentally roll back to Java instructions, but still detect changes in the Java documentation so we can integrate them, if needed.

We can leave this in the pipeline if it is possible, but we are in need of a solution to update the documents to include .NET information that is totally missing and remove the irrelevant Java configuration information and code examples.

Short Term Plan

So, in the near term, where is the best place to put the new namespace documentation for Codecs?

Long Term Plan

This is my proposal, but feel free to improve upon this idea or propose a different one. Perhaps we should aim to break this into 2 phases:

  1. JavaDoc to Markdown phase
    • Converts the URLs into specialized tokens
    • Converts namespaces
    • Saves any additional state needed for the second phase (TOC, etc)
    • Run via command line manually when we target a new Lucene version (happens rarely)
    • Updates "stepping stone" documents that will be checked into the repo so it is easy to detect changes
    • The command line tool could also automatically make a backup of just the "stepping stone" documents before the conversion to a directory outside of the repo so they can be compared with BeyondCompare
    • The "stepping stone" documents will each have a header to inform developers
      • The information doesn't apply to .NET
      • The documents are automated and should not be updated manually
    • Changes are moved from the "stepping stone" documents into the actual markdown documents manually
  2. Markdown to API website phase
    • Actual markdown documents are updated manually (this will happen frequently, as we have many changes to document)
      • Stripped of any irrelevant Java information
      • Add relevant .NET information
      • Code examples converted
      • May include tokens for generating URLs that are the same format as the "stepping stone" documents
      • May need to add additional pages that don't correspond to anything in Java
    • API doc generation from these documents is fully automated
    • Converts tokens into actual URLs

Effectively the "stepping stone" documents (for lack of a better term) would simply be there to detect changes from the Java world (only happens when we port a new version of Lucene), which would then be manually copied (or simply ignored) on the .NET documentation side of things. I suppose the "stepping stone" documents don't even have to be in the same repo, as they are only for tracking changes.

Keep in mind there are only ~250 namespace documents and only a few dozen of them have more than a single line description. There isn't much manual work involved in converting them initially, and there will be even less manual work to convert the changes when we upgrade to the next version of Lucene. The laborious part of the cleanup is to convert all of the URLs into tokens, convert namespaces, and JavaDoc format into Markdown, which we have covered (mostly?) with automation.

Shazwazza commented 4 years ago

I noticed that the documentation on the beta-00008 file is now linking to the wrong document (looks like it was the right document in the beta-00007 document).

Will be some annoying issue with namespace project changes from before to now and not tracking, i'll have to track it down and fix. The fix will be part of the conversion tool to put files into the right places.

and breadcrumbs (that are pointing to the test framework).

Due to so many overlapping namespaces breadcrumbs can be annoying it's a known issue with docfx and I haven't heard a reply from them. If/when we think it's important enough we'll probably have to write javascript to fix, see https://github.com/dotnet/docfx/issues/2041#issuecomment-328394103

The rest of your info is stuff that we need to do and figure out. Before we consider doing that we need to ensure the files are in the right places (i.e. not all of them are as you have seen from the Codec thing, but i worked on this for a long time so it there's probably very few that are incorrect. The problem is that so much of this was entirely changed again 2 versions ago so i had to re-map everything).

But yes at the end of the day it's going to be much easier for us to just change the converted files and commit then, just means we'll need to figure out a way to re-execute the conversion docs and merging new changes. Keep in mind that we may end up needing to re-run the conversion tool because even after we've started manually editing files even before another major because we may discover that a ton of markdown has been converted incorrectly due to the source java files being very odd. This has happened frequently in this process because the java docs are so inconsistent. But maybe we deal with that on a case by case basis, i don't think there's going to be a magic way of automating the re-merge. Only feasible thing i can think of is to save the automated docs to a 'baseline' (possibly a git branch). Then we edit all the files we want to and if we need to re-execute the conversion tool, we do so on the baseline git branch and merge that forward to ours. Might work.

Docfx also has something called 'overwrite' files which I was going to use for namespace docs that weren't converted from javadocs, this also means we wouldn't need to worry about them re-overwriting our files. I just haven't go to this stage of the process yet.

So, in the near term, where is the best place to put the new namespace documentation for Codecs?

First i need to fix the codec mappings. Then the javadoc file that you can edit will exist here /src/Lucene.Net/Codecs/package.md

NightOwl888 commented 4 years ago

So, in the near term, where is the best place to put the new namespace documentation for Codecs?

First i need to fix the codec mappings. Then the javadoc file that you can edit will exist here /src/Lucene.Net/Codecs/package.md

Got it. I will update the document in a branch and submit a PR so you can do updates to master. Then I can fix any conflicts before merging.

Any thoughts on how to add additional documents? For example, it might be best to separate ASP.NET Core configuration examples from command line examples or other application frameworks.

Shazwazza commented 4 years ago

For the short term of fixing this codec issue - and there are others similar:

The issue is that the way lucene classes/namespaces are structures overlap with each other. This also why we have the breadcrumb issue. Because of this we get overlapping uid values where a uid is the unique name of a document. For example, the UID for the codec namespace within lucene 'core' package is uid: Lucene.Net.Codecs and the UID for the codec namespace within lucene's 'codec' package is uid: Lucene.Net.Codecs and again this overlaps with the code package in the test-framework: uid: Lucene.Net.Codecs

The reason this didn't behave this way in the previous docs version was because the converter wasn't picking up all of the namespace files and including them in the conversion which meant a lot of docs were missing, however it meant because we were missing docs we probably didn't have uid collisions.

There's a couple ways to fix this:

I'll try to the first one now since it's easier, but both have pros/cons, i actually think the second option might be safest but i'm unsure how it will work just yet. I'll just to have give these a shot.

Any thoughts on how to add additional documents? For example, it might be best to separate ASP.NET Core configuration examples from command line examples or other application frameworks.

There's a few ways to do examples. DocFx supports 'overwrite files' like i mentioned and this is already configured but i haven't played around with it much yet. Basically it allows you to add metadata to any conceptual (i.e. code) document. So for example, in c# you might have a class but you didn't add /// <example> put code here </example> inline in your code. Well you can add this metadata to that class via overwrite files. There can be a number of examples per class or method too. So that's one way, but depends on where you want the docs or examples to live. Each 'namespace' or 'class' page can have any amount of information on it before it starts listing members, etc... There's a tabbed markdown feature https://dotnet.github.io/docfx/spec/docfx_flavored_markdown.html#tabbed-content which might work for varying frameworks, but i'm unsure if that would work for overwrite files for /// style code but maybe. I guess there's a lot of ways to do things so would need a concrete example of what you 'want' to do .

https://dotnet.github.io/docfx/tutorial/intro_overwrite_files.html

Shazwazza commented 4 years ago

Actually this isn't going to work:

Change the uid values to include the orig package name like core/Lucene.Net.Codecs (or similar).

This is because internally for each namespace an automatic UID will be assigned to it by docfx and we cannot change it. This is really why there's odd behavior especially for the Lucene.Net.Codec one because it exists in 3 places. This is also the problem with the breadcrumb. Changing the structure to be more like the java docs i think is the only good way forward. For example, you start at the landing page https://lucene.apache.org/core/8_5_2/index.html and click on a package, it will take you to a docs 'mini site' only for that package, if you want to go see another package you have to come back to this landing page. We can do the same, but at least ours will have a link back to the landing page.

I'll see what i can do but this will solve a lot of issues.

NightOwl888 commented 4 years ago

I guess there's a lot of ways to do things so would need a concrete example of what you 'want' to do .

The example that was followed to extend both codecs and "system properties" is outlined in DI-Friendly Framework. The concept is rather abstract, and I was hoping to provide some specific examples that pertain only to codecs (in particular for ASP.NET Core and in console applications).

Microsoft has a Dependency Injection in ASP.NET Core that gives a high-level overview. However, when searching for a similar document for console applications, there are only examples such as this one by a 3rd party.

Effectively, what I am hoping for is

I think the ones with the asterisk are the most important.

Ideally, they would be separate documents that are referenced in other relevant places. For example, there should eventually be a document that explains how to configure the test framework with DI in general, which would link over to the how-to on testing a codec.

I am also kicking around the idea of making extension methods for Microsoft.Extensions.DependencyInjection to make registration of common Lucene.NET services seamless (which of course would simplify the configuration and the docs). For example:

public void ConfigureServices(IServiceCollection services)
{
    services.AddRazorPages();
    services.AddDefaultLuceneCodecs(); // Register Lucene.NET Codecs with DI in one line

    services.AddScoped<IMyDependency, MyDependency>();
    services.AddTransient<IOperationTransient, Operation>();
}

One issue with doing so is whether we should take a dependency on Micorsoft.Extensions.DependencyInjection.Abstractions to build the functionality into our existing assemblies, or create additional integration assemblies (and there would need to be several of them). The cost of taking on a non-invasive dependency to make it "just work" with Microsoft apps seems low compared to creating several packages for the sake of integration, so I am leaning toward that approach, especially since we have already taken a dependency on Microsoft.Extensions.Configuration.Abstractions for a similar reason.

NightOwl888 commented 4 years ago

Actually this isn't going to work:

Change the uid values to include the orig package name like core/Lucene.Net.Codecs (or similar).

This is because internally for each namespace an automatic UID will be assigned to it by docfx and we cannot change it. This is really why there's odd behavior especially for the Lucene.Net.Codec one because it exists in 3 places.

Seems a bit odd that it is not supported, as it is quite common for Microsoft .NET assemblies to extend namespaces. In fact, in the [Dependency injection in ASP.NET Core]() document, they specifically recommend putting extension methods in the Microsoft.Extensions.DependencyInjection namespace even though the assembly name will have to be different to avoid a collision.

Any chance we could contribute it back to docFx?

Shazwazza commented 4 years ago

Right well if you just want to be able to have pure documentation pages like how-to's then we just make normal pages and table of contents, just like the normal lucene website. Like the MS docs you mentioned https://docs.microsoft.com/en-us/aspnet/core/fundamentals/dependency-injection?view=aspnetcore-3.1 these are not based on API generated docs, these are just pages which of course we can do. If you want to put docs/examples directly into the namespace generated code for namespaces/classes that is fine too and that is how the current lucene 'overview' files work. Apart from that though we can just make our own pages do to whatever we want like the website.

Seems a bit odd that it is not supported, as it is quite common for Microsoft .NET assemblies to extend namespaces

Yep i know and it's been mentioned a few times in that docfx thread but they aren't responded to that and I'm unsure what the story is for this for DocFx 3 (or when that is ever released). The end result though is trying to change the UIDs for namespaces is going to result in huge changes to everything and I don't really want to go down that rabbit hole since I think the get that right is going to take weeks of my time. I'm pretty sure I can get the site build with sub-sites and have it looking/working relatively similar. The challenge from there is not being able to cross reference a doc by ID between different site builds but I don't think the lucene docs can do that either. Though at least for that scenario it would be fairly easy to build a plugin to enhance cross referencing. There's still a bunch of docfx features I'm investigating for this, it might turn out that there's work arounds for this already.

NightOwl888 commented 4 years ago

I have created the new PR #291, which contains the draft of codec documentation updates. I ended up putting all of the documentation in the package.md file since it turned out to be much less text than I had envisioned.

However, since it is all in one document, there are several places where it references types in Lucene.Net.Codecs.dll and Lucene.Net.TestFramework.dll. For now, I just use the same xref links for everything. Is there a convention you have in mind to use for referencing the sub-sites?

I would appreciate a review of the PR. Thanks.

Shazwazza commented 4 years ago

I've spent quite a lot of time over the weekend investigating options for this issue. Going with the option of creating 'mini sites' has shortcomings as well because the xref won't resolve between sites since they would be compiled separately from each other. I haven't had enough time to see what extending the xref behavior requires and hope that might be possible. I've looked into potentially keeping our structure so that xref behavior 'just works' and was hoping to change the behavior of docfx to be in control over how the api docs are produced. I've tried creating a researching any type of docfx plugin I can find and trawling through the source code to try to figure out if the API builder can be extended and unfortunately I'm coming up short of a hopeful solution. I've asked questions on their gitter channel as well as re-asking them on GH but have received no response. From what I can tell in the docfx code, the API docs and metadata for them are generated directly from using Roslyn which is done in this class and the process starts here but none of it is extendable that I can see and the API docs extraction doesn't use their regular pluggable techniques. The only possible extendable place I can see is some obscure roslyn extension methods but all of this is entirely undocumented so I don't even know if i want to attempt to go down that avenue. Ideally, if we could 'just' change the root namespace UID for each 'package' to be specific to that 'package' then all classes/namespaces in that tree would inherit that base UID which in turn would make unique xref's. But I do not see this as a possibility in docfx. So am left with 2 options:

That's the status so far, i wish there was an actual documented way to do this. I'll spend a bit more time this week to see what I can find and hopefully come up with some proof of concept.

Shazwazza commented 4 years ago

Have had success with the mini site option! I don't think option 2 is feasible as there would be huge amounts of changes needed to docfx. Will keep plugging away at this but i think it's a winner and will actually end up being nicer to use.

NightOwl888 commented 4 years ago

Great. If that works, we can use it.

However, I took a look at the source code and it seems that the UID functionality can be overridden by injecting a custom IAssemblySymbol (which can be a decorator around the original). While it will take more than just this one decorator to ensure that this can be injected, the difficult part is getting to the point where the top level dependency can be overridden to inject our custom decorators.

I walked up the class dependencies:

internal class RoslynMetadataExtractor (Inject IAssemblySymbol into constructor)
public class RoslynIntermediateMetadataExtractor : IExtractor (Extract())
public class RoslynSourceFileBuildController : IRoslynBuildController (ExtractMetadata())
public sealed class ExtractMetadataWorker : IDisposable (SaveAllMembersFromCacheAsync())
internal sealed class MetadataCommand : ISubCommand (Exec())

[CommandOption("metadata", "Generate YAML files from source code")]
internal sealed class MetadataCommandCreator : CommandCreator<MetadataCommandOptions, MetadataCommand>
{
    public override MetadataCommand CreateCommand(MetadataCommandOptions options, ISubCommandController controller)
    {
        return new MetadataCommand(options);
    }
}

So, at the top level, we end up with an abstract factory, which is perfect for injecting a custom class with dependencies. The command is being exported using the MEF ExportAttribute which is subclassed by CommandOptionAttribute.

I did a bit of research, and it seems that adding an ExportMetadataAttribute can be used to create a custom component with a higher priority than the original component in order to replace the built-in factory with a custom one.

Ultimately, it is the CompositionContainer that is responsible for resolving the instance of the MetadataCommandCreator class, which is what needs to be tested to verify that we can override the default MetadataCommandCreator with a custom one. From that point, replacing and/or decorating the components with custom ones is fairly straightforward.

Shazwazza commented 4 years ago

Hi @NightOwl888 thanks for having a peek through the source too. We are already using extensions for DocFx, that is what our LuceneDocsPlugins csproj is all about and that is using MEF to extend the functionality. I also discovered that IAssemblySymbol is what sets the UID based on the namespace but that is a Roslyn object and not part of docfx and as far as I can tell none of the ManagedReference functionality that puts together the API docs isn't using MEF for composing itself though it is used in other places like generating Conception (i.e. non API docs) . I didn't see that the IExtractor was public, perhaps there's a way to override that but because MEF isn't used in composing this part of the app I'm just not sure. There was a single extension point I did find for the ManagedReference which are part of ExtractMetadataOptions.RoslynExtensionMethods and these extensions are discovered by RoslynIntermediateMetadataExtractor.GetAllExtensionMethodsFromCompilation though i didn't spend much more time to see if this can work.

The mini site approach seems to be the preferred approach to multi-project systems. That is what Microsoft are doing (MS docs are built with docfx with probably a lot of customization and knowledge of work arounds), for example: https://docs.microsoft.com/en-us/dotnet/api/?view=aspnetcore-3.1 this is their landing page and each links off to a 'sub site' which is essentially what I'm doing now. Have also chatted with another project and their usage of docfx and that is what they do as well.

I've discovered more extensibility too which I'm now using. For each subsite an xrefdoc.yml is generated which is a list of cross reference links. Then for each build of each site and main site you can feed in external xrefdoc.yml so that cross references work between projects. I'm trialing this now but there will still be some interesting 'gotchas' because there will still be overlapping namespace UIDs which I still need to figure out. There's also something called a pre-processor which is done at the template building level via nodejs and you can completely override all metadata for each page including namespace information, etc.. .but for some reason i think the UID is protected. I still haven't fully pursued this entirely but it's on my list of tools that we can use.