fedwiki / wiki-server

Federated Wiki client and server in Node.js
Other
153 stars 35 forks source link

Make sitemap generation protect against overlong synopsis #136

Closed paul90 closed 6 years ago

paul90 commented 6 years ago

Currently the first paragraph is used as the synopsis in the sitemap. There appears to be some authors that are creating pages with a long first paragraph, that contains the entire page.

To protect against sitemaps becoming over large by limiting the length of the synopsis.

Long term we probably should remove the synopsis, and move to a different search mechanism.

WardCunningham commented 6 years ago

Good point. I've thought about this on occasion but haven't been annoyed to action. Both client and server use the same code. A good place to apply a limit would be in the synopsis.coffee return statement. github

  return synopsis.substring 0, 140*4

I choose 140*4 as a reasonable limit since that is twice Twitter's newly generous upper bound.

Here is a neat unix command that will compute a distribution of synopsis lengths working with 10 character bins. I apply this to my own sites where I try to set a good example.

curl -s $site/system.sitemap.json | jq '.[].synopsis|length/10|floor' | sort -n | uniq -c

For site=ward.asia.wiki.org we find nearly all fit within one half this limit.

Note: 4 2 means 4 pages had a synopsis length in the range 20-29 characters.

   4 2
   2 3
   5 4
   4 5
   3 6
  10 7
  17 8
  13 9
  12 10
  18 11
  11 12
  19 13
  15 14
  24 15
  18 16
  16 17
  15 18
  13 19
   8 20
  18 21
   8 22
  12 23
   8 24
   7 25
   6 26
   3 27
   8 28
   5 29
   3 30
   3 31
   1 32
   3 33
   1 34
   2 38
   1 39

For site=ward.bay.wiki.org I seem to run on a bit longer on occasion with one synopsis clipped.

   1 2
   3 3
   2 4
   2 5
   4 7
   1 8
  10 9
   6 10
   5 11
   6 12
  11 13
   9 14
  17 15
  24 16
  13 17
  11 18
  21 19
  16 20
  23 21
  14 22
  13 23
  13 24
  15 25
  14 26
  11 27
  14 28
  10 29
   9 30
  15 31
  11 32
  14 33
   6 34
   5 35
   6 36
   7 37
   5 38
   8 39
   1 40
   2 41
   1 42
   2 43
   2 44
   1 45
   5 46
   1 47
   1 48
   1 49
   1 50
   2 51
   1 52
   1 58
paul90 commented 6 years ago

Closed by fedwiki/wiki-client#201