Raku / doc

🦋 Raku documentation
https://docs.raku.org/
Artistic License 2.0
289 stars 291 forks source link

Standardize search categories #1410

Closed coke closed 2 years ago

coke commented 7 years ago

The current build generates the following categories of items for the (search) index.

""
"&?Routine"
"-->"
"0B (Radix Form)"
"5to6-perlfunc"
"Buildall (Method)"
"Class"
"Coercion Type (Signature)"
"Compunit"
"Control Flow"
"Declarator"
"Enum"
"Environment Variables"
"Function Reference (Constrain)"
"Infix"
"Language"
"Listop"
"Matching Adverb"
"Method"
"Parameter"
"Phasers"
"Postcircumfix"
"Postfix"
"Prefix"
"Proc Object"
"Proc::Async Object"
"Programs"
"Quote"
"Reference"
"Regex Adverb"
"Regex Quantifier"
"Regex"
"Role"
"Routine"
"Running Programs"
"Sub"
"Subscript Adverb"
"Syntax"
"Term"
"Trait"
"Type Constraint"
"Type"
"Variable"

Some of these have potential overlap, like Class & Type; others, like 0B (Radix Form) or Proc::Async Object shouldn't be top level categories at all. Items with only one entry are also suspect, like "Parameter"

AlexDaniel commented 7 years ago

Can we have a TODO list of items that should be fixed or eliminated? Like

Mark things that were fixed (or that should not be fixed) as ✓.

@coke what did you do to get this list? Can you update the list in this ticket?

JJ commented 6 years ago

Ping :-)

JJ commented 6 years ago

Once again, these categories are generated automatically by htmlify. Beside those above, &?BLOCK now has a category too. Related to #1823

JJ commented 6 years ago

The CompUnit category has disappeared. As for the rest, I think it's better to create a test for what's left there; not clear to me what's the desired target number of categories, if any. Also, not clear how new categories are created. Maybe add something about that in the CONTRIBUTING document.

JJ commented 6 years ago

Still not too clear how categories are created... I'll check this out.

JJ commented 6 years ago

The problem is that this

doc/Language/glossary.pod6:=head1 X<Semilist>

generates a "" category listing, I don't know why. I don't know either what is better, fix the index (there are a lot) or fix htmlify.p6... #1823 is relevant here.

antoniogamiz commented 5 years ago

Category creation

Currently, categories are assigned using the subkind attribute, except if it's a complete document, in that case kind is used as category. Most of the values subkind can take are fixed (check subkind token).

The problem is X<> elements in headings. They are always considered a valid definition and its meta part is taken as subkind. So, if we have an element X<a|b>, a category 'b' will be created.

How to obtain these values

You can download Perl6::Documentable and execute:

use Perl6::Documentable::Registry;

my $registry = Perl6::Documentable::Registry.new(
        :$cache,
        :$topdir,
        :dirs(["Language", "Type", "Programs", "Native"]),
        :verbose($v)
);
$registry.compose;

# json list containing all search entries
say $registry.generate-search-index();

Or you can go to http://docs.perl6.org/js/search.js and copy paste the content of var item.

Solution

I do not know what should I do with these. We need to discuss how X<> elements should be treated or propose an alternative to how subkind is set in this case. I can think of two options:

Let me know your opinions.

Altai-man commented 5 years ago

We need to discuss how X<> elements should be treated or propose an alternative to how subkind is set in this case

My small idea:

I suspect a lot of warnings will be produced for the first run, but once we'll get rid of them, It will be easy to maintain an understandable and solid set of search categories.

The second important thing is, of course, to have this list somewhere documented on docs contributing page and mention it in warning message, e.g. "See allowed categories at foo.bar.com", so that people will be able to select a correct one.

It will put a bit of an end to all this X<> anchoring cargo cult we(or I, at least) do now. Secondly, it is just not very ok that someone can write X<my-bad-thing|bad-thing-again> and boom, we have bad-thing-again category in search.

Altai-man commented 5 years ago

Just to be clear: even without an ill will intended, humans are not best when it comes to being perfect, so when someone lazy like me adds an anchor in operators category and does a typo opeators, then we suddenly have operators category and opeators too, which is LTA. With a warning it is easy to fix that, but without one who knows how much time it'll take for someone to notice and do a patch.

Judging from the fact we have categories like "Buildall (Method)" it is not always even clear for people what should be in X<|>, if it should be a category, what syntax does it have etc, and this sort of clarity's absence is LTA.

antoniogamiz commented 5 years ago

Mm, I like your idea @Altai-man, it's doable. Now the problem is to define the list of posible values for $foo. Here is a list with all categories. x-categories contains all categories coming from X<> elements (the ones we need to standardize)

Altai-man commented 5 years ago

What do you mean by "list of posible values for $foo"?

My quick thoughts about x-categories:

Should be:
regex
regex quantifier
parameter
matching adverb
regex adverb
substitution adverb
"hyper" probably should be "operators", why don't we have such a category
"Async Phasers" are probably just "Phasers"
control flow
Zen slice (Basics) should have a better name
block (Basics) should have a better name
everything under `statement prefix` becomes it, e.g. `statement prefix` instead of `eager (statement prefix)`
identifier
macro
pack should be something better, what is it, a sub?
`is default` is just (Traits)
v6 should have a better category
`with orwith without` should be (control flow)
5to6-perlfunc - do we need it?

I included "fitting"(IMO) ones, commented some weird ones and excluded unfitting ones.

antoniogamiz commented 5 years ago

$foo is referred to your previous comment.

5to6-perlfunc is a little "weird", because that file is handled apart from the rest. Only parts coming from `5to6-perlfunc.pod6' have that category.

If everyone is happy with those categories I can change the X<> elements and throw a warning when a new one is found.

Altai-man commented 5 years ago

Ah, I see. I see a bunch of inconsistencies in "All categories" list too. Also, why do we have "Python" as a search category? Why not "Haskell" or "Ada"?

@antoniogamiz you should probably ask someone experienced to write out categories of things in the language, because :cached, Python, lazy (statement prefix), :sym<> are hardly correct categories.

antoniogamiz commented 5 years ago

@Altai-man those categories comes from the current X<> elements at the docs. You can even create your own categories writing X<s|thecategoryyouwant> in some pod file.

Altai-man commented 5 years ago

I have already explained why this is a bad thing in https://github.com/perl6/doc/issues/1410#issuecomment-516032885 I think we should change those into a more nice list.

For example, search category Package Type has - class, role, etc. Or search category trait can show you is default, is rw, etc. They do make sense. But when one has a category :cached or react (statement prefix), what will be there? Only :cached experimental pragma? Only react statement prefix?

antoniogamiz commented 5 years ago

I know, I only said that now we need to replace the right side of those X<> elements, and if everyone agreed, I can replace them with your suggestions (I do not know enough to propose correct category names).

Altai-man commented 5 years ago

we need to replace the right side of those X<> elements

Right.

I do not know enough to propose correct category names)

Unfortunately, the same goes for me.

Altai-man commented 3 years ago

An update: current "categories" are:

So in three years it became more messy.

coke commented 3 years ago

I can add a test to catch that we don't any new ones, at least, until we decide what the correct listing is. Any interest?

Altai-man commented 3 years ago

I can add a test to catch that we don't any new ones, at least, until we decide what the correct listing is. Any interest?

This would be very helpful!

I am working at "deciding" the correct listing this exact moment...

Altai-man commented 3 years ago

Questions / things to note:

What categories I suggest (in parentheses I put items from what we have now that will be absorbed, if something is not present it should be removed):

What bothers me is that we currently have "Language", "Syntax", "Reference". What is there is often mis-categorized (e.g. builtin variables should to go "Variables", not to "Reference") or mixed. What is even the purpose of those three?

I imagine "Syntax" explains syntax bits (quoting, keywords, etc), "Reference" explains semantic bits (what twigils are, things like that) and "Language" is, uhh... I know it contains pages under "Language" category, but still as search items their titles are not always very welcome. But anyway, we probably can live with them not to over-complicate.

Thanks for reading to this point... What should be done next:

coke commented 3 years ago

I was unable to zef install Perl6::Documentable - not found; makes it hard to run the snippet above. Any suggestions on how to programmatically get the list?

Altai-man commented 3 years ago

Do zef install Documentable, that's the way. I think it is a dependency of this repo, by the way, no?

Then you do:

use Documentable::Search;
    my $search-generator = Documentable::Search.new(prefix => $host.config.url-prefix);
    my @items = $search-generator.generate-entries($host.registry);

and grep on lines produced by it. If you don't like lines, you can look at implementation of generate-entires and copy some code.

JJ commented 3 years ago

Questions / things to note:

* Does "quoting" deserve a category on its own or being inside "Syntax" is good?

It's got its own braid and all. So yes.

* Should "Proto regexes" be categorized as "Syntax" OR "Regex"? At one hand it is a regex thing, but `my`, `our` should be categorized into "Syntax".

Probably

* We should not have a "Python" category. At all, this is wrong at so many levels.

I remember vaguely we've been there already. Let me search back issues.

* "rakudoc" category with single "INTRODUCTION" must be just "Rakudoc" item in "Language" category

OK

* https://docs.raku.org/syntax/syntax is something we all should regret to have and we want to put a redirect and get rid of this

Why? And please, no redirects... Let's just try and have things that are programmed and tested and well specified.

* [ ]  Programs - How to debug, run, type

All these categories are reasonable; but I'd like to see which ones are removed. That's probably more significant.

What bothers me is that we currently have "Language", "Syntax", "Reference". What is there is often mis-categorized (e.g. builtin variables should to go "Variables", not to "Reference") or mixed. What is even the purpose of those three?

Hard to say. They were already there when I arrived. Would probably need to dig into the blame for those lines.

I imagine "Syntax" explains syntax bits (quoting, keywords, etc), "Reference" explains semantic bits (what twigils are, things like that) and "Language" is, uhh... I know it contains pages under "Language" category, but still as search items their titles are not always very welcome. But anyway, we probably can live with them not to over-complicate.

Syntax is generally those things that are pure syntax and are not a function or anything like that like if or do. But here's the thing, I think we're mixing two different things here. One's search term categorization, which is something, other is page categorization, which is... totally different and created somewhere else.

Thanks for reading to this point... What should be done next:

* If the list above is not ok - comment

I'd rather see a list of what needs to disappear.

* If it's ok - write a test that screams when it sees anything but the categories mentioned.

The thing is page generation is tested under CircleCI, which is the only one that has Documentable. I'd like to keep it there, if possible. Documentable is not a stable module, and I'd really not like it to be a dependence of this, so if it's needed for some test, please take it to the CircleCI build.

* Re-categorize all the very wrong cases correctly in small chunks checking if a separate page for an item is broken somehow.

While making sure that there's no big change in URLs. There shouldn't be, at least at the path level, but just to make it clear.

JJ commented 3 years ago

Re Python, see #2355

Altai-man commented 3 years ago

Probably

A "Should we choose X or Y" question does not get on with a "Probably" answer.

Why?

Because someone has indexed https://docs.raku.org/language/functions#Blocks_and_lambdas so wrong it created a page with URL literally being /syntax/syntax. The text of the item was extracted, but presented on the page with a completely different (and making no sense) (Functions) -> syntax syntax title. This is wrong because the user can literally type syntax in search bar and they will see a "valid" item from -> category that leads to a page incorrectly formed.

This is why it should be re-indexed and this URL should point to something else, thus a redirect.

"No redirects" is not an option, we have fallen soldiers, had before and will be forced to have since now. "No redirects" means "Abandon them". And it does not interfere in having a stable implementation in any way.

but I'd like to see which ones are removed. That's probably more significant

^ removed completely. Everything else from the list just migrates to new categories suggested (corresponding items are noted in parentheses after category names).

Syntax is generally those things that are pure syntax and are not a function or anything like that like if or do. But here's the thing, I think we're mixing two different things here. One's search term categorization, which is something, other is page categorization, which is... totally different and created somewhere else.

Yes, that's the feeling I have. But we can live with it for now, I'd say.

I'd rather see a list of what needs to disappear.

See above, plus what should be absorbed is already proposed above.

Documentable is not a stable module

Why?

JJ commented 3 years ago

Probably

A "Should we choose X or Y" question does not get on with a "Probably" answer.

Well, leaning towards "yes", but don't have a strong opinion for this.

Why?

Because someone has indexed https://docs.raku.org/language/functions#Blocks_and_lambdas so wrong it created a page with URL literally being /syntax/syntax. The text of the item was extracted, but presented on the page with a completely different (and making no sense) (Functions) -> syntax syntax title. This is wrong because the user can literally type syntax in search bar and they will see a "valid" item from -> category that leads to a page incorrectly formed.

But I would say that's an example of bad indexing, not a systemic problem. And that can be fixed now. Probably, going forward, there should be a way of banning this kind of things (but I don't really see how we can prevent all possible mistakes)

This is why it should be re-indexed and this URL should point to something else, thus a redirect.

Still, I'd rather have no redirects. Right now there are a few tweaks you have to make (mainly to serve files with no extension as HTML), as well as some special treatment for things that have a "." in its name. If we want to ban that kind of things in the index, so be it: let's add a test, or whatever to avoid that. But "solving" it with a redirect is simply kicking the ball down the field.

"No redirects" is not an option, we have fallen soldiers, had before and will be forced to have since now. "No redirects" means "Abandon them". And it does not interfere in having a stable implementation in any way.

Well, I should maybe qualify that. First, expanding it to meaning "don't try to solve any problem with the document generation using infrastructure". Second, qualifying it to mean "anything that needs special treatment should be well specified and dealt with within the doc generation framework".

but I'd like to see which ones are removed. That's probably more significant

* [ ]  Python

See #2355. Please reopen it if you really have a strong opinion.

* [ ]  hash (Basics)

* [ ]  scalar (Basics)

* [ ]  statement (Basics)

* [ ]  string literal (Basics)

This is all originated in the 101 page that was incorporated coming from somewhere else. It probably makes sense to eliminate them, but then again, indexing policy is not something that should be done in an ad-hoc way. And then, making an accept-list the default policy does not really solve the issues related to indexing that are there: #3458 and #3520, for instance. Also #2575 which was closed and probably should not. We don't even have an unified criterium for category naming; these above should probably be banned just on the basis of using parentheses...

* [ ]  TOP

And this one on the basis of using all CAPS.

* [ ]  topic variable (Basics)

* [ ]  variable interpolation (Basics)

* [ ]  :cached

* [ ]  classes

* [ ]

We should keep the all whitespace search category. Just kidding.

* [ ]  ->

^ removed completely. Everything else from the list just migrates to new categories suggested (corresponding items are noted in parentheses after category names).

Except for Python, (and maybe classes? I really have no idea about that one) I mostly agree. The problem is not that we agree on these categories (or not), is that we need to create a spec for categories, and have all existing ones follow that spec, raising errors a warning if someone creates a new category, and an error if they are not up to spec. This applies to Python, for instance, and mostly to any of them. We can discuss all the way to Mendocino and back if Python should be in (or we should also add Ruby or Perl; BTW, all perl2x categories are special-treated in the documentation, IARC), but at the end of the day it's a judgment call. Having a search category spec or rule that can be enforced will put us on a different ground.

Syntax is generally those things that are pure syntax and are not a function or anything like that like if or do. But here's the thing, I think we're mixing two different things here. One's search term categorization, which is something, other is page categorization, which is... totally different and created somewhere else.

Yes, that's the feeling I have. But we can live with it for now, I'd say.

I'd rather see a list of what needs to disappear.

See above, plus what should be absorbed is already proposed above.

Documentable is not a stable module

Why?

The main problem, the way I see it, is that it's not tested against what we want to achieve from it. It's unit tested (and that's a big improvement over the htmlify.p6 we had before), it's tested for build errors (in CircleCI), but there's no test that checks if what's generated will fit what we already have in the deployed docs. That's bit us (hard) several times in the past. It might be that the coverage of the unit tests is not really complete, and for the time being we don't have a coverage test in Raku to check that. This lack of through testing stems, in part, from the problems with ambiguity of documentation specs. Indexing rules are still ad hoc, generated URLs are not specified, we're just discussing that search categories are not specified either. That's bound to generate some instability. I'm really not blaming Documentable, I'm just stating a fact. The consequence of which is that there needs to be a very thorough (and manual) testing of doc generation before we use a new version. That's just the way it is; again, not complaining.

Altai-man commented 3 years ago

Probably, going forward, there should be a way of banning this kind of things (but I don't really see how we can prevent all possible mistakes)

The way is, IMO:

1)Spec a list of allowed categories (what this issue is about). 2)Absorb or fix every index case that does not match. 3)Apply a test @coke nicely suggested to write, checking if we have bad cases in an automated way.

And then, making an accept-list the default policy does not really solve the issues related to indexing that are there: #3458 and #3520, for instance

The first issue you refer to relies upon us having a spec/standard of search categories and points to this issue. So this issue must be resolved first and I proposed a solution above. The second issue is just a consequence of not having search categories spec available, resulting in ad-hoc category creation, again relying on resolving this.

We don't even have an unified criterium for category naming; these above should probably be banned just on the basis of using parentheses...

So let's create it. I made a proposal above stating the categories. If something is obviously wrong with that, let's tweak it. If it sounds sane, let's go with it, document it and close this ticket after the test is done and docs are adapted.

This lack of through testing stems, in part, from the problems with ambiguity of documentation specs. Indexing rules are still ad hoc, generated URLs are not specified, we're just discussing that search categories are not specified either.

Let's make it stable then. Having a standard for search categories is a step towards that. It does not suddenly solves every single issue, but we won't get anywhere without solving specs one by one because we don't have them or they are not so complete just now.

I understand your worries and not wanting to deal with possible messing up in process, it was shaky and partially is.

Maybe it is worth to work on this in a calm branch then? Potential changes won't affect anything on master there and the new tooling will work with a branch easily. When/if it is stable enough and nicer indeed than it is now, it is not a great deal to just migrate changes.

JJ commented 3 years ago

Well, some category errors are in this branch, so they should be fixed. I've already created an issue for that.

We need to be on the same page, however, regarding categories. You say accept list, but as I say, that's simply kicking down the ball. It means that we will have to discuss every single addition to the accept list. Let's try to do this: let's look at what the current list of categories has in common, and what makes them acceptable. Let's iterate until we deduce a set of rules from them. In this process, let's also take into account that documentation pages have metadata that includes categories, so we might want to have some common criteria for both. Eventually, we'll get a rule-based accept-list, but also a rule to test new categories as they are created. Also, I think that this procedure should take place in the problem-solving repo, for maximum audience. When we have that, let's get back here and solve this issue. Would that be acceptable for everyone?

Altai-man commented 3 years ago

Would that be acceptable for everyone?

Returning back to this. I don't think it should be such a bureaucratic process for such a trivial issue.

What this issue is really about?

We have a language. Its design is more or less done over last 20 years. In a language we have things. We document them. When user searches, we want to categorize things for them into some categories to ease the process. So, basically, we have classes, roles, routines, methods, infix ops, etc etc. Things like that.

There was no such a list of categories from the start and people did ad-hoc solutions they saw fitting at the time when the decision was made. Alas, it resulted in various mistakes (when totally incorrect things started to count as a category) or just inconsistent styling (e.g. uppercase vs lowercase).

What I suggest is to simply write a list here (there is a suggestion above, if you don't like it - comment what exactly is wrong there until we all satisfied). Document it, write a test and gradually update our docs we have now to adhere to the list. That will resolve this issue, once and for all, really.

There is no need for a whole process of acceptance or rules for new categories. New categories - will there be some?

but also a rule to test new categories as they are created

Do we really need one? Now, after 20 years, we have, say, classes and roles. Maybe tomorrow someone will invent croles as well - okay, we will simply add another category to the list, because this is a serious addition. Though, honestly, I doubt this will happen, but nonetheless.

If you really insist, I will initiate a problem solving ticket and be the one who suggests a solution (a copy-paste of what is above, but still), because I agree it is hard to make a decision for everyone to adhere when you feel responsible.

JJ commented 3 years ago

It's probably the best to have something working. But I would still like to have something resembling a "bureaucratic process", probably in a different thread, because I've not been here for a long time, but it's long enough to see how unsolved problems tend to show up again a few months down the line.

Altai-man commented 3 years ago

So we had https://github.com/Raku/problem-solving/issues/250 for more than 2 weeks and no progress.

Altai-man commented 2 years ago

So after 5 years, I believe this can be closed, as we have a standard list of categories set and clarified and reflected in docs. Thanks for everyone involved.