Autogenerating Website from Code Comments: A Proposal Sketch

Hi @SicroAtGit, I have a proposal: I would like to write an app that can generate the required HTML pages to create a website for PureBasic-CodeArchiv-Rebirth (via GitHub Pages project's website). I would take care of all the coding, so it won't burden you any further, and when it's ready — and if you like it — you could then merge it in.

Here is a general sketch of what I have in mind, and I'd like to create a system that doesn't add a burder in the project maintainance, but actually slims it down.

HTML Files and Website Structure

Set GitHub Pages to publish from master branch, so that the actual webpages are HTML files inside the repo, alongside with the code files.
Have single index.html per folder, containing auto-generated resume cards for all the source files inside that folder (and navigation links).
Have a single assets folder in the root, for CSS stylesheets and other shared assets, to prevent pollution of the project with lots of unrelated files.

Contents from Source Comments

The app should generated a resume card for each souce code file or project, by parsing the comments in the file. There would be very little changes (if any) to comments as they are now: the app's parser should be smart to detect what is what by looking for "<key>: <value>" structures. At the most, for special cases like multiline verbatim blocks (as in a license text), a single unobstrusive character might be added after the comment delimiter in order to allow the parser to know if that comment line is significant. For example:

;| The pipe after ";" could be a non-intrusive special char for the parser.
;{| Between the comment delimiter and the special char there might be a space
; | or one of the special PB comments chars "{ } -", so that code folding and
; | indexing might be preserved.
;}
;@ Different chars can be used to tell the parser what is what.

Special needs could be handled by machine generated comments at the end of the file.

Because of the "<key>: <value>" structure, the current comments system is already simple to parse, and intuitive to use for coders who want to add some code :

;   Description: Threadsafe FIFO-BufferQueue
;            OS: Mac, Windows, Linux
; English-Forum: 
;  French-Forum: 
;  German-Forum: http://www.purebasic.fr/german/viewtopic.php?f=8&t=27824
; -----------------------------------------------------------------------------

Oviously, the new system will impose strict rules on the entries structure and naming convention, but this is inline with the project's goals anyhow.

The system should be simple enough that contributors can handle it from within PB IDE, but it would be good to provide a dedicated IDE tool for validating the comments (check that all required keys are present, and that their format is correct).

The HTML Generator App

The app for generating the resume cards and HTML pages will be written in PureBASIC (I can reuse code I've written for the PB Archives CMS I'm working on). If needed, it might rely on some external tools, like pandoc for format conversion — possibly, standalone binary tools only. But I believe that the required HTML will be simple enough to handle it from PB code.

The app could be hosted inside the assets folder, along with sytlesheet and images, to avoid creating more folders. So, we won't be needing separate developers branches: everything needed to generate the HTML pages will be kept in the repo's master branch, without polluting it.

Keep in mind that, beside adding an online website to the PureBasic-CodeArchiv-Rebirth project, this will also make it browsable locally, and simplify the user experience of navigating through its contents.

Furthermore, the same data extracted from the comments and used to build the website pages could also be used (in the future) to build a database of the project's contents. We could then create a UI app that enables to quickly search and browse the project's contents using criterias such as OS compatiblity, categories, dependencies, tags, etc. So there is actually more potential in this idea then just a website.

Now, if you think the ideas is worth a try, I shall be working on it locally, and when a usable prototype is ready for testing I could create a testing ground repository so we can fine tune it.

If needed, it might rely on some external tools, like pandoc for format conversion

I prefer a variant that only requires PB. External tools must be recompiled each time a new version is released by the developer that fixes important bugs. Maybe we'll find annoying bugs later on, which we can't easily fix ourselves and the developer refuses to fix them or is no longer reachable.

We could then create a UI app that enables to quickly search and browse the project's contents using criterias such as OS compatiblity, categories, dependencies, tags, etc.

I had the same idea with the filter functions, but I didn't figure out how I can implement it within Github. Github Pages don't support javascript or PHP, I think -- and I don't want to set up an external web server for it. A tool that works offline seems to be the best solution.

I agree with the HTML generator app.

Thank you very much for your help so far.

I prefer a variant that only requires PB. External tools must be recompiled each time a new version is released by the developer that fixes important bugs. Maybe we'll find annoying bugs later on, which we can't easily fix ourselves and the developer refuses to fix them or is no longer reachable.

Ok, no problem with that. I was just thinking that if the app could process some text blocks via pandoc (via the Process library) it would allow to use markdown to HTML conversion; this could have been beneficial for the homepage, or even for formatting long descriptions. Keep in mind that pandoc is a small cross platform app, it can be installed or just unzipped into the folder in standalone, I've even created script to automatically download it and unpack it. My guess is that any user who works with Git and GitHub is likely or willing to install pandoc because it supports all the markup syntaxes used by GitHub for documentation.

I had the same idea with the filter functions, but I didn't figure out how I can implement it within Github. Github Pages don't support javascript or PHP, I think -- and I don't want to set up an external web server for it. A tool that works offline seems to be the best solution.

I was thinking of a local app, to allow users to browse their clone of the project quickly. An online version would require a special server. Maybe, once the website is in place we could look on how to implement it in JavaScript (or in SpiderBasic); the actual indexing could be handled by the static website creator app, and the index file manipulated via JavaScript. Modern browsers can easily handle this amount of data without suffering performance.

I agree with the HTML generator app.

Ok, than I'll start working on it, and when a usable prototype is ready I'll share it on GitHub (either on a branch or in a test repo).

Thank you very much for your help so far.

Thanks to you! this is the best PureBASIC project alive today, and I believe that it should receive top priority. My impression is that many users from the PB community are not attracted to Git and GitHub, but creating a browsable website is a good move to make it more accessible. Any step we take to make it easier to use will contribute to raise its popularity in the PB community.

I'm convinced that for any project there is a popularity threshold which is liking a tipping point, once the threshold has been reached a project becomes very popular and users start to contribute and use it as a reference. In this specific case, it is very important that this project could become a central reference, and attract PB user to contribute their code here, so that the current state of code being scattered all over the place, hardly findable in the huge forums maze of threads, could be dispelled.

I'm also thinking that in my upcoming revival of the PureBASIC Archives project I'll be moving all the code I can here, and just keep references and tutorials in my project, both to avoid code duplication and to reinforce the concept of this project being the main reference.

Keep in mind that pandoc is a small cross platform app, it can be installed or just unzipped into the folder in standalone, I've even created script to automatically download it and unpack it. My guess is that any user who works with Git and GitHub is likely or willing to install pandoc because it supports all the markup syntaxes used by GitHub for documentation.

I just visited the pandoc website. For Linux only a 64-bit version is offered as package, but nowadays most of the developers surely have a 64-bit system. When pandoc works out of the box. Then I agree to the use of this tool. But every developer should download the tool for himself. If we were to put the tool in the repository for all three operating systems (windows, linux, mac), it would cause the repository to expand enormously. Big binary files are bad in a git repository.

I fully agree with you on your other points.

Thanks for the kind words.

But every developer should download the tool for himself.

Not contrituros, only project maintainers/admins: contributers have to make a pull-request anyhow, and when an admin merges the pull-request he could run the app again to update the webiste pages. I would think that the burden of updating the project pages should be on the project admins' shoulders, and that pull requests should concern code only.

After all, imagine this scenario: 3 code author have updated their source and made a pull request — not all at once, but they have queued up during a month, after all it's normal that project admins reserve only certain days to look at pull requests. If all three contributors had also updates the HTML pages these would overwrite each other, possibly with conflicts, so the admin will have to cherry-pick the commits anyhow, and keep only the source changes. Also, if the HTML pages were updated after the PR was opened, the commiter will have to rebase his commit if he changed the HTML contents (and this would likely create conflicts)

After merging-in the three update code the admin then runs the app to update the HTML pages accordingly — after all, one page per folder means that a single source changed or added will require the whole page to be rebuilt.

If we were to put the tool in the repository for all three operating systems (windows, linux, mac), it would cause the repository to expand enormously. Big binary files are bad in a git repository.

No, it wouldn't be a good idea. Pandoc is about 50 Mb, which is not really big for today standards. If we were to stick to the same pandoc version for the whole project (ie: not updating it to use new features), than maybe the 50 MB wouldn't be a big issue, even if we were to add both Win and Linux binaries.

But the best solution is to provide a script that auto-downloads it and unpacks it to the app's folder. I've used this approach with a documentation project that relies on different tools (pandoc and three others), to ensure that the exact versions of these tools are downloaded:

https://github.com/tajmone/polygen-docs/tree/master/docs-src/tools

This is a preview of the final document:

http://htmlpreview.github.io/?https://github.com/tajmone/polygen-docs/blob/master/polygen-spec_EN.html

... without those tools it wouldn't have been possible to handle highlighting of that specific syntax (BNF-2 derivate).

As a final consideration, the App for generating/updating the HTML pages is going to be something that only the repo's admin will probably use — it's going to be in the assets folders, with the CSS and images, so probably most users won't be even looking there. I believe that most admins of the project have access to Windows OS, or Linux x64.

Alternatively, we could use GitHub web API to convert the (very few) markdown blocks of text: it's slower because of the internet connection, but works great.

The pandoc need was due to the fact that, beside the source files resume cards (which will be handle by the PB app entirely), we'll need an introduction page for the home, and maybe a descriptive paragraph for each folder/section. For the homepage, the ideal would be to have the app just take the README.md file and convert it to HTML, and then inject it into the website/pages template via the PB app.

Using GitHub's API should be fine too, after all we'll be dealing with a limited number of pages. Besides, the app should also cache contents, and check if the file SHA1/2 (or some other fingerprint) has changed since the last time a README was converted or source file was manipulated (the cache will be hidden in the apps folder, and gitignored altogether).

I've been thinking on how to handle the comments parsing, and here is the solution I came up with.

My consideration were that the system shoud have the following characteristics:

Existing comment headers should require very little changes (single chars additions that can be handled via RegEx search-&-replace)
The added char(s) must not be visually intrusive and compromise human readibility of the header by distracting the eye.
The pasring system should be:
- simple
- fast
- tollerant of textual variations (aliasing)
- extensible via simple settings files (JSON or the like)

So here is the proposal...

The app's parsing goal is to focus only on the block of comments found at the beginning of the file:

The comments block parsing ends when the firs non-comment line is encountered

It's reasonable to expect all the data we need to be in that block, as it is customary in such header comments.

The parser extracts <key>:<value> string pairs from comment lines that start with the ;: delimiter:

;:            OS: Windows, Linux, Mac
;: English-Forum: http://www.purebasic.fr/english/viewtopic.php?
;:  French-Forum: 
;:  German-Forum: http://www.purebasic.fr/german/viewtopic.php?
; -----------------------------------------------------------------------------

The ;: combination is pleasant to the eye: the colon is similar to the semicolon, so it goes almost unnoticed, and because it tends to form a vertical line with the other colon below, it doesn't disturb reading.

The parse will extract from the above examples the following <key>:<value> pairs (leading and trailing whitespace is stripped):

OS:Windows, Linux, Mac
English-Forum: http://www.purebasic.fr/english/viewtopic.php?
French-Forum: (empty)
German-Forum: http://www.purebasic.fr/german/viewtopic.php?

The parser itself is going to be "dumb", and just make a list of strings out of them (duplicates allowed). After parsing, the app will convert all keys to identifiers (ascii conversion, lowercase, spaces to underscore) and look up the identifiers in a Map (definable via a settings file) to determine if a key is of interest for building the resume or not (in the latter case, just discards it).

This approach allows to easily integrate new keys into the sytem via settings file. Also, the map allows aliases to map to the same significant key, which might be useful in some cases.

For long <value> entries that span across multiple lines, a carry-on comment delimiter ;. will be used. After parsing a key-value pair, the parser will always check if the next line starts by ;., and if it does it will carry on parsing the following lines until a non-carry-on comment is encountered (or the end of the block). Example:

;:   Description: A very long descriptive text of what this piece of code
;.                does and doesn't do. It keeps going on for serverl lines,
;.                This is the last line.
;:            OS: Windows, Linux, Mac
;: English-Forum: http://www.purebasic.fr/english/viewtopic.php?
;:  French-Forum: 
;:  German-Forum: http://www.purebasic.fr/german/viewtopic.php?
; -----------------------------------------------------------------------------

Again, the dot of the ;. delimiter is non-ivasive and blends well with the colon and semicolons. Also, both : and . are easy to remember (as the : is also used as a separator after the actual key).

In carry-on values, whitespace will be trimmed differently: the indententtion of the first carry-on line becomes the base indentation that will be stripped off from the rest of the following lines, so that any intended indentation will be preserved in the final block of text.

When using carry-on values, the value might actually start on the second line altogether:

;: Description:
;.    A very long descriptive text of what this piece of code
;.    does and doesn't do. It keeps going on for serverl lines,
;.    This is the last line.

... this allows some flexibility, and to have same-width text in the extracted text (most likely, the description text will be rendered as an HTML <pre> block, since it often contains significant spacing, ascii lists, etc.).

Finally, our special comment delimiters should allow the special comment marks used for folding by PB IDE ({ }), as some users might add the folding marks to allow shrinking away the header block:

;{: Description:
;.    A very long descriptive text of what this piece of code
;.    does and doesn't do. It keeps going on for serverl lines,
;}.   This is the last line.

The - mark is not expected to be found in this context (and, unlike the folding mark, it would brake due to the adjacent : anyhow).

So, resuming:

we'll be only adding a : or a . to the comment delimiter of the lines which are meaningful to the parser; this is going to be easy also on the code maintainers.
We'll be extracing <key>:<value> string pairs, which will be then looked up in a Map that allows aliases and slight text variations, so that the app can establish which key is what. (for example, created_by could be an alias of author; French and German translation could map to English, etc.)
The entires of the Map used to correlate found keys to variables which are meaningful for the creation of the cards can be changed and extended via a simple settings file.
The app will then further manipulate the value strings according to what they are used for (dates, OS names, authors lists, etc), using dedicated code for each specific key — some values might need manipulation (dates, version numbers, etc.), others not, depending if we need to the data to make decisions on it or if we just need to pass it on as is.

For example, a key like OS should be smart to understand not only "OS: Windows, Linux, Mac" but also all, macOS, OSX, XP, etc. But these details will be handled once there is some working code.

What do you think of this approach? Is it going to be simple enough on the coders side, and flexible enough for the project maintainers?

I would think that the burden of updating the project pages should be on the project admins' shoulders, and that pull requests should concern code only.

Yeah, that makes more sense. I haven't given it enough thought.

Alternatively, we could use GitHub web API to convert the (very few) markdown blocks of text: it's slower because of the internet connection, but works great.

The variant with the tool "pandoc" is now very good, after I realized that only the repository administrators have to install and execute the tool and not all collaborators. It is better when no internet access is required to generate HTML pages.

The pandoc need was due to the fact that, beside the source files resume cards (which will be handle by the PB app entirely), we'll need an introduction page for the home, and maybe a descriptive paragraph for each folder/section. For the homepage, the ideal would be to have the app just take the README.md file and convert it to HTML, and then inject it into the website/pages template via the PB app.

For each section a README file, which is used for the introductory texts of the HTML pages. Very good idea!

Using GitHub's API should be fine too, after all we'll be dealing with a limited number of pages.

As I have already written above, I prefer an offline variant that generates the HTML files. The advantage of the variant with Github API is - if I'm right - that here no tools (pandoc) have to be installed by the repository administrators and the interaction with Github API can be done via a tool written exclusively with PB (without third-party programs). But I ask myself if the PB network commands are sufficient for the Github API. The commands are not very mature, in the support of modern network technologies (e. g. REST API).

For example, a key like OS should be smart to understand not only "OS: Windows, Linux, Mac" but also all, macOS, OSX, XP, etc. But these details will be handled once there is some working code.

The listing of all supported operating system versions (XP, Win7, Win8, Win10) doesn't make it any easier to understand. In addition, such a detailed specification requires a lot of tests, which we can't perform in such a small team. In addition, the team doesn't have all operating systems available for such tests - and the team should be able to test the codes before they are included in the archive. The codes in the archive should always work with the most common operating system versions. If a code that no longer works under a modern MacOS version, even though it ran in an older MacOS version, the code should be removed from the archive. But as you said before, we can discuss this later on in a new thread/issue, otherwise it gets too confusing here.

What do you think of this approach? Is it going to be simple enough on the coders side, and flexible enough for the project maintainers?

Sounds pretty good to me.

If you want to add multi-line support to the code header entry "description" to allow more detailed descriptions, it would probably be good if there was an entry in the code header that only allows a short description. I have kept the description of the existing codes short, because I have planned them to display the description in a commands overview.

Hi Sicro, I've just finished publishing a first working draft in a dedicated repository:

https://github.com/tajmone/PBCodeArcProto

The <key>:<value> comments parser is already working — but nothing is done with the extracted data for now.

I thought that having a dedicate repo for developing this app would be better, especially since it allows us to create Issues in total freedom, without cluttering this project.

I'll add you to the project admins — I'm not asking you to actively engange in it, as I said I don't want to add a burden to your dedication to the main project (my idea, mine the bills to!). But you're most welcome to, if you feel like.

It's also an easy way to follow the progress of the project — by adding the repo to the Watch list, anyone can get a notification when something new is going on in the project.

I have no idea how long it will take before a working app will be ready, but since the PureBasic-CodeArchive-Rebirth is not dependent on this application time shouldn't be an issue. I'll be working on-and-off the project, because I have a few other things that I need to dedicate time to. Hopefully, it shouldn't take too long; and my guess is that most of the time will end up in creating the CSS files, tweaking the process, etc.

Anyhow, this looks like a good start, and on the new repo we can create an Issue for every single thing, be it an idea, a comment or an actual bug found.

After reading your last comments, and looking again at the actual files headers, I think we could then make it even simpler:

the whole parsing system will be build around the current headers (real case scenarios over theoretic benefits)
the paerser is totally dumb: it extracts <key>:<value> string pairs as they are, in the order they appear in the header, and beside trimming leading and trailing whitespace it doesn't try to interpret them but just insert them in the resume card as they are.
if <value> string matches a link pattern ((f|ht)tps?://) then it will be rendered as a link.

the above means that if we were to add a new file with:

;:   GitHub repo: https://github.com/MrX/SOme-Project
;:       Website: https://MrX.github.com/SOme-Project

... these will show up as links (just like the forum links) without having to change anything in the app's code or settings.

As for the multiline values, I think we should just keep it for the sake of long entries, but I could change its default behavior:

carry-on line are joined together as single paragraphs, except if an empty line is encountered. This means the deafult is to render multiline <value> string as plain HTML.
the syntax <key>: | <value> (ie: when <value> is preceded by a pipe |) will enforce a verbatim approach on the value string: whitespace is preserved and the value is rendered as <pre> block. This is how the parser currently handles multilines (spaces and linebreaks are preserved).

The above means that the following:

;:  description: Some rather long description which is broken into multiple
;.      lines only to keep comments lines within 78 columns.
;.
;.      A second paragraph.

... will be rendered in the final HTML as:

<p>Some rather long description which is broken into multiple lines only to keep comments lines within 78 columns.</p>

<p>A second paragraph.</p>

while the following:

;:  description: |
;.      A multiline description intended by its author as a verbatim block:
;.
;.         1. Whitespace is significant
;.         2. And should be preserved as-is, without attempts
;.            to interpret it.

... will be rendered in the final HTML as:

<pre>A multiline description intended by its author as a verbatim block:

   1. Whitespace is significant
   2. And should be preserved as-is, without attempts
      to interpret it.</pre>

(I suggested the | symbol because it's used in pandoc markdown for this very purpose, so it should familiar to some users. YAML also uses it for multiline blocks.)

We might not be using the verbatim right now, but it's worth keeping it as it's likely that sooner or later it will be useful.

For the task of building HTML cards the simplest approach is the best, and key/values interpretation is not required. The original approach I came up with was slightly more complex because I was thinking of the possibility of resuing the parsed data to build the catalogue GUI app, which would require data interpretation for indexing.

Now the parser creates an HTML Resume Card for every parsed file and saves it to file as <filename>.html. You can preview an HTML card here:

http://htmlpreview.github.io/?https://github.com/tajmone/PBCodeArcProto/blob/master/test_files/02_multi-line-simple.pb.html

When converting the extracted <key>:<value> pairs to HTML:

Key and Value strings are XML escaped
EOLs in Value strings are converted to <br />.

Links in value strings are not yet handled, I'll add a RegEx to catch any URL and turn it into an HTML link.

As you can see, I've simplified the whole parsing approach: value strings are trimmed and joined into paragraphs, blank carry-on comment lines separate paragraphs — the whole verbatim and indent preservation system has been discarded for now.

This allows to split into multiline comments some of the existing comments which are very long (far beyond 80 columns) in the source files, and have them show as a single paragraph in the html page.

Anyhow, I think that from here on any further discussion on the app should be taken on in its repo's issues:

https://github.com/tajmone/PBCodeArcProto/issues

I'll be opening issues for any important topic regarding the app's design, problems, etc.; so you'll get notified by Wacthing the repo.

SicroAtGit / PB-CodeArchiv-Rebirth