gemrb / gemrb.github.io

Main website of the project
https://gemrb.github.io
MIT License
3 stars 2 forks source link

API docs: format conversion #9

Closed lynxlynxlynx closed 4 years ago

lynxlynxlynx commented 4 years ago

API docs are inline in the code and we have a script extract them to then put on he web. We do this periodically, usually before a release and that's fine. However the docs are in Dokuwiki format and will need to be converted to Markdown.

10 will find the general solution and here we need to integrate it into the regexy generator script:

https://github.com/gemrb/gemrb/blob/master/admin/guidoc_wikifier.sh

kindaro commented 4 years ago

This looks approachable. I think I can solve it within a week. There are a few details though.

  1. I need a clear criterion for success. Is it sufficient if I extract every doc-string into a separate, appropriately named markdown file?
  2. Depending on the availability of the required feature, we may need to use Pandoc as either:

    • An online service.
    • An off-the-shelf executable.
    • A custom-built executable.

    It may happen that the user will need to follow some instructions to obtain the custom-built executable with the features we need.

  3. Pandoc can put the docs together and build the whole web page. So we may have an eye on eventually going all in and using it to build the whole web site.

Let me know what you think.

lynxlynxlynx commented 4 years ago
  1. No, there are a few other files in gemrb/docs/en/GUIScript/, including an index. Check dumpDocs(). The pages will also need jekyll frontmatter at the start:
    ---
    title: blablabla
    ---

    We could also add a "module" line, but that would require you first read the function list for both (it's at the end), so let's leave this for later.

I can code up the index once the files are there ("jekyll collection"). Mimic the structure we have now:

GUIScript/functions/ <-- all the extracted files
GUIScript/ <-- all the static files + index
  1. This is overkill, since not much markup is used. From a quick sampling: bullet lists (same syntax), bold (same), implicit (same) and explicit (same?) code blocks, links and headers. Just convert them in python.

  2. I don't see any benefit, since another site just means more fragmentation and/or work.

kindaro commented 4 years ago

So we are not using Pandoc in the end? I thought that was your proposition based on #10. Now your proposition is that we roll out a custom parser and pretty printer?

lynxlynxlynx commented 4 years ago

It's what I'm using for the website content transmogrification, but here there's no need, since there's just 2-3 search&replace calls to be done. The content is already structured, so there's no extra pretty printing needed either.

kindaro commented 4 years ago

I see where this is going. You are suggesting that we can convert DokuWiki into Markdown via a handful of regular expressions.

Approaching structured data as if it were known to be a regular language (that is, with a belief that regular expressions are sufficient for manipulating it) is asking for trouble. I can see how it could be tempting to believe that Markdown and DokuWiki have the same structure and therefore only the decorations (that we assume are regular) need to be changed. Surely individual things, like //italic// markers, can be replaced. But DokuWiki format is large enough to make doubt reasonable. For example, how would a regular expression know not to replace italic markers inside a code block? Your argument would go along the lines of «we do not use italics inside code blocks», but I would hate to bet that we never use anything identical to DokuWiki markup inside code blocks. It is unfounded and dangerous to believe that the subset of DokuWiki that we shall ever need to convert is simple enough for regular expressions to handle without corner cases.

The right approach is to use the existing, well supported solution, which is Pandoc. The source of pain is the menacing ghost of building a large program written in an unfamiliar language, with an obscure build system. If you think this pain outweighs the pain of dealing with regular expressions, then order me and I shall do your bidding, but I shall have to disclaim responsibility for italics inside code blocks.

In short, you should stop micromanaging and let me do the right thing.

lynxlynxlynx commented 4 years ago

I don't know if you're aware, but you come across as pretty hostile. There's no need for that. And if you don't want my opinion, then please don't ask for it.

If you want to over-engineer the solution, that's up to you, it's your time. This is not so critical that dependency issues would matter much. The problem space is much smaller than you think, since the docs were autogenerated, but since you want something more generic, just go for it. Perhaps pandoc will be enough; in my use for porting the content, it hasn't always produced good results, however for this subset of syntax, I think it'll do fine.

kindaro commented 4 years ago

No hostility is intended. Please assume good intentions. I am sorry for any misunderstanding.

kindaro commented 4 years ago

I came up with 2 ways of extracting doc-strings. See:

  1. By exploiting the C++ source.
    • I am thinking that this may be improved by making GemRBMethods public and linking against GUIScript in good faith instead of this #include hack.
  2. By querying the Python interpreter.
    • I prefer this method since it seems generally reasonable to have a way to fire up the Python interpreter (without booting any specific game).
    • P. S. Since this way we have access to all possible meta information, wildest dreams can be implemented.
  3. P. S. I suppose I can research some more and spin up a more comprehensive instantiation of the engine, including an Interface and whatever else needed to run a Python interpreter the way it currently runs in actual gaming experience.
  4. P. S. If the worst comes to the worst, I think we can parse the C++ source through either GCC or Clang. There are some tools.
    • But it goes against the purpose of the change. We want to make the extraction of the doc-strings resilient against the refactoring of the source code, so that we reliably obtain a doc-string exactly when the corresponding function is made available to the Python interpreter. Clearly the syntactic shape of the code is not enough to determine that. Using a syntax-aware «grep» is only marginally better than the current approach with regular expressions.

Seeing how our previous conversations uncovered certain differences of opinion, I am expecting that you will be grumpy about my aspirations to replace sed/awk hacks with a principled solution, and none of the above will get merged. But you are welcome to surprize me.

P. S. I gave Pandoc a spin, it works out of the box.

What I mean to express is that the problem of converting DokuWiki markup into Markdown is made trivial by Pandoc. At worst, some small patches to the latter may be required, which is within my power to produce. The real problem is that the architecture of GemRB does not easily allow to extract the doc-strings in the first place.

At this point I need to know which, if any, of the ways outlined above I can expect to be merged, and then I hope to move this conversation into a draft pull request for polish and review.

kindaro commented 4 years ago

I went about considering how better to refactor GUIScript to expose both the value and the type (length) of a method array. A problem is that the implicit size parameter makes it impossible to link to an array across translation units. One way to solve it is to define a method array as a static constant in a header, so that every translation unit including it received an identical hard-wired unchangeable copy. Otherwise the number of the entries — that is to say, the complete type of an array — may be put in a header, and then the value of the array can be linked against, but it is a bit inconvenient to have to adjust both the header and the definition whenever a method is added or dropped. Finally, the correct solution could be to embrace modern C++ features and use a container with template iterator interface — why are we stuck with C-style byte management in the first place?

On the other hand, we may be better off considering GUIScript exclusively Python API focused and instead make it easy and cheap to run a Python interpreter. That direction is of larger scope, so I have not been looking there yet.

lynxlynxlynx commented 4 years ago

So what was to be a diff of about 10 lines is now one of 31 thousand and it's not even done yet. Impressive! :open_mouth:

I see three ways to keep extraction much simpler (besides the current 3 lines) and both do it all in python. You already got the introspection bits, just do the rest and stick it into our demo gametype to be ran through gemrb (if an env var is set). We already did that when we needed some text related test torture and a graphical loading test elsewhere. Or just leave it as something for the user to run manually through our console.

There's also our twisted connector contrib/manhole.py, which looks like it needs just two small tweaks to be ready for this job.

Making the modules generally importable just to be able to see its method list is not justified. The modules are useless without the engine, so the extra complexity just makes things worse. For something that is run around once per year.

All the six or so proposed solutions are still dodgy though. And not even in the "perfect is the enemy of good" sense. A simple script does the job and does it well, also being almost independent of the code it works on, not something that will work just from the merge point on.

It appears I made another mistake — it's not just your time being wasted. Too late now.

kindaro commented 4 years ago

I see three ways to keep extraction much simpler …

This is the kind of opinion I can use. It is impossible to know at the outset which direction will turn out to be the most fruitful, so an advice from a person familiar with the architecture is a boon.

All the six or so proposed solutions are still dodgy though. And not even in the "perfect is the enemy of good" sense. A simple script does the job and does it well, also being almost independent of the code it works on, not something that will work just from the merge point on.

It is extremely dependent on the code it works on. As I argued above, regular expressions are not suitable for dealing with any but the simplest kind of language. A smallest syntactic alteration will send a sed/awk solution down in flames. It seems impossible to argue in favour of the solution that is currently in place.


But I can guess from your tentative remarks that you are indeed grumpy. I can see now that your approach to development is extremely conservative, and, had I known it sooner, I may not have offered my services in the first place. You should be careful not to spill your discontent onto other open source participants though. If you imply that I am wasting someone else's time, then it is a grave accusation that I have done nothing to deserve, and it is not acceptable. As a maintainer, you can accept or decline any proposed change at your discretion. But you do not get to belittle others.

lynxlynxlynx commented 4 years ago

Like I said initially, we don't need a perfect dokuwiki2markdown converter. We're not parsing a language, but a few tags. Even pandoc doesn't provide a full translator. We use a very small subset of the language and in case any conversion was missed also in testing, causing misrendering (which is not a given, since they share syntax), it would be easy to fix and redeploy as it was found. I don't understand the argument about fragility either. Sure, in general, but this works already right now and we have control over the input.

Also, it's a temporary solution anyway, since we should just convert the source docstrings eventually. Now is just not a good time due to the refactoring going on. And when the time comes, it will be done very easily in the editor of choice of the person doing it.

kindaro commented 4 years ago

Like I said initially, we don't need a perfect dokuwiki2markdown converter. We're not parsing a language, but a few tags. Even pandoc doesn't provide a full translator.

As I understood from our previous conversation, you have no objections to Pandoc in particular as a solution to the conversion problem. As I said previously, I tried it out and it seems to work without a flaw, so our real problem is not to convert, but to extract.

We use a very small subset of the language and in case any conversion was missed also in testing, causing misrendering (which is not a given, since they share syntax), it would be easy to fix and redeploy as it was found.

If we botch the extraction, however, it would not be easy to notice. Currently, a tiny change to the source code can lead to a method's documentation going missing.

As you can see, I am willing to go to some length to make sure the solution is maintainable and future proof. With your permission, and as per your advice, I shall go and research the ways to get that interpreter running with the smallest possible expense, with an eye to a draft pull request. The solution that involves firing up a Python interpreter and querying it can guarantee in a straightforward fashion that all built-in methods available to the user have a corresponding page on the web site, so I too have a liking to it. I suppose in a few days a proposal can be ready.

lynxlynxlynx commented 4 years ago

It's overkill, but sure. My other worry was that the version I tried didn't support dokuwiki, however I've upgraded the box since, so hopefully that's changed.

Why would we botch the extraction? You'd have to try deliberately, since the code also still needs to compile. But even if we did, it'd be as easy to notice as bad transformations — can't be done unless looked at. Well, except for the case where really nothing is output, which is trivial to detect if desired.

Let's see what you come up with and if the jumping through hoops is suddenly justified.

kindaro commented 4 years ago

I had some health problems, so I had to delay this work. I am back on track now, so please wait a few more days.

kindaro commented 4 years ago

@lynxlynxlynx  See https://github.com/gemrb/gemrb/pull/698.  Please review and tell me whether this aligns with what you had in mind.

lynxlynxlynx commented 4 years ago

What happened with the twisted approach?

kindaro commented 4 years ago

I decided that it is not well motivated. A strong down side is that it implies launching an actual game and issuing commands via console. So, for example, it would be difficult to automate. Overall, it seems unnecessarily complex.

lynxlynxlynx commented 4 years ago

It would require interactivity only if badly designed, but none of your approaches are trivially automated any way, unlike the original, so that's a moot point. But I agree, unnecessarily complex.

lynxlynxlynx commented 4 years ago

To illustrate, this is what I did in the end, since the bug became a blocker for launch: https://github.com/gemrb/gemrb/commit/06ddb42e09c4d6ca9447ae8cdbf83bb618d84af2

The results are already up: https://gemrb.github.io/GUIScript/functions/ApplyEffect.html

If you want to make your solution perfect, you're welcome to, but it would remain a (portability) exercise. Up to you, we don't need to close the PR immediately.

kindaro commented 4 years ago

@lynxlynxlynx  I see, you waited since November, but you could not wait a few days more for me to get the build to work on Windows, and I suppose you did not have a Linux installation at hand. Disappointing.