Closed tliacas closed 1 year ago
This already exists and can be found at https://fr.wiki.lehub.ca/index.php/Spécial:Statistiques and https://en.wiki.lehub.ca/index.php/Special:Statistics
However, all that's missing is the total word count of all content pages.
I will take a look at the above and report back
So regarding the word count, my research shows that you are right, cirrus Search which itself relies on Elastic Search seems to be the prescribed way to do this.
The next step will be to install Elastic Search in its own container, then link it to Mediawiki and install CirrusSearch.
I will be on this next week, if all goes well this might take a few hours; however I cannot guarantee at this point that I will not run into unexpected issues.
Qui update on this: I installed Elastica, Cirrus and ElasticSearch on a local environment, however it seems to have trouble indexing and keeps displaying 0 as a word count. I'll put another few hours to try to debug this, then we'll have to see if we want to think of a plan B. I will report back shortly.
After a bit more tweaking here, I managed to create a local wiki instance which counts my real words, which is a good proof of concept and allows me to be rather confident that we can have this on the live site by next Monday:
Having an issue updating the word count, that is described at https://stackoverflow.com/questions/75269346/mediawiki-cirrus-elastica-elasticsearch-how-to-update-words-in-all-content
The initial approach did not work on the production site: it would have required running ElasticSearch which was failing, even when I double the resources allocated to the server. I was also getting a bunch of other issues with it.
I therefore went back to the drawing board and decided to just code it myself.
I'll close this as the word stats are now good. Regarding the page count issue, I have opened the follow-up https://github.com/LeHubca/lehub-mediawiki/issues/22
Hi Albert, as we promote the wiki and report progress, esp. to funders, would be great to have a special page where we can see overall wiki stats at a glance.
Something like: https://en.wikipedia.org/wiki/Special:Statistics but much simpler with only fields such as: -total content pages -total uploaded files -total word count of all content pages
Was hoping this could be done through a simple plugin but could not find one.
When I researched how wikipedia does this, here is what I found in a forum:
en.wikipedia.org/wiki/Special:Statistics does indeed rely on CirrusSearch to count the number of words in the content pages.
CirrusSearch relies on elastic search to search content on the wikis and we use the feature named token-count to count words. The field used is the text field which is a plain text version of the HTML output of the wiki page and does indeed contains table values and template content, it is mostly everything that is visible when viewing the page from a web browser. The way the text is tokenized into words is pretty standard.
If you are curious to know what is inside this text field please append &action=cirrusDump to any wiki page URL, this will display a json document in which you will find the field named text (you might want to install a browser extension to display json properly or copy/paste it into a tool that presents it to you nicely).