ether / etherpad-lite

Etherpad: A modern really-real-time collaborative document editor.
http://docs.etherpad.org/
Apache License 2.0
16.61k stars 2.85k forks source link

Reliable backups for high-activity pad #3250

Closed njoyard closed 4 years ago

njoyard commented 7 years ago

Hi,

We have a production instance of etherpad-lite 1.6.1 for a nonprofit, which apart from being used normally with many pads, has a specific pad that has a lot of activity and history, as it is used kind of as a board of things to do/in progress/recently done and is updated many times every day by several members of the org, and it is hardly ever deleted and recreated.

This makes it a pad that comes to have tens of thousands of revisions after a few months. I'm quite sure this is not really what etherpad-lite was designed for, but "unfortunately" the org members like this way of working very much and are very used to it and we've still not found a better tool.

We've already had several catastrophes with this specific pad due to kernel panics, unclean shutdowns, mysql restarts (mostly for sec upgrades) without stopping etherpad first, corrupt changesets that lead to the high-activity pad being unreadable (fails with Failed assertion: Invalid changeset (checkRep failed) client-side). Other lower activity pads on the instance seem to cope with those events quite nicely though. Additionally, some members sometimes have a faulty connection that causes their browser to reconnect very often and I wonder if that doesn't generate even more revisions to fade their author color each time. That's a secondary problem, but if it's the case then it also makes the pad history grow even faster and increases the chances of failures as I perceive it.

Restoring attempts for this pad usually includes:

That situation led me to try and setup frequent backups of the whole instance. I'm actually doing hourly mysql dumps right now (at least for the last 24h). Unfortunately I discovered that restoring those backups also lead to a nonworking checkRep failed pad. Which led me to believe that doing mysql dumps actually produces a faulty database image unless etherpad is stopped.

I would have used the API to make backups but after a few weeks/months of activity the API calls just take longer than the backup interval. And stopping the instance every hour to run mysqldump would be quite disruptive.

So here are my questions:

Here is a faulty mysqldump for reference. The high-activity pad ID is "affaires-courantes". All our activity and pads are public so there's no risk of disclosing personal/secret information here.

Thanks :)

lpagliari commented 7 years ago

I don't have any pad with such amount of changes to test, but my first try would be to export the pad to the "etherpad" type (ex https://beta.etherpad.org/p/affaires-courantes/export/etherpad). I'm not sure if that was what you tried when you said "use the API to make backups", so forgive me if this is something you've already tried.

njoyard commented 7 years ago

I have not tried that indeed, will give it a go when we reach a few 1000s revisions.

njoyard commented 7 years ago

Here is a backup made using the API. Unfortunately, upon restoration, it produces the dreaded Error: Failed assertion: Invalid changeset (checkRep failed) 😢

affaires-courantes.zip

JohnMcLear commented 4 years ago

Sorry about the delay here, things have been busy. Since you posted this Etherpad has had lots of stability patches. Even still, I'm grabbning your SQL and will write a script to get the last good data out using some regressive checks for pad stability. Returning the HTML of the last stable pad and ideally I will make it call a createPad with the HTML. If I can, I will dump the atext so no data is lost. Watch this space..

JohnMcLear commented 4 years ago

Okay so I imported your broken .etherpad file and bin/extractPadData.js does get the pad object and I can view the atext. Is this the initial atext?


[2020-05-07 12:52:21.632] [INFO] console - pad:RJIedIxhEhRgDy-ETGNJ {
  atext: {
    text: 'En cours de traitement à Regards Citoyens\n' +
      '\n' +
      "Si  vous  êtes perdu, demandez de l'aide à fmassot, kerneis, Massiliane, njoyard, Roux, teymour, tarNeFys sur notre canal IRC :   #regardscitoyens@irc
.freenode.net https://kiwiirc.com/client/irc.freenode.net/?nick=citoyen&#regardscitoyens\n" +
      '\n' +
      "Merci d'indiquer votre nom ou votre nickname en haut à droite afin que vous apparaissiez dans la liste des contributeurs\n" +
      '\n' +
      'Vous pouvez naviguer plus simplement sur ce pad en utilisant SummarizePad https://bjperson.github.io/summarize-pad/\n' +
      '\n' +
      '__________________________________________________________\n' +
      '\n' +
      'Planning SuperHéro\n' +
      '\n' +
      'Par défaut, le planning se fait par ordre alphabétique des noms. Si vous ne pouvez pas, trouvez une personne pour vous remplacer cette semaine là\n' +
      '\n' +
      'Semaine 11/09 - 17/09 : Benjamin ok\n' +
      'Semaine 18/09 - 24/09 : Nicolas ok\n' +
      'Semaine 25/09 - 01/10 : Tangui ok\n' +
      "Semaine 02/10 - 08/10 : François ok je vais pas être très dispo les 2 prochaines semaines à cause du boulot :'(. Si je peux switcher avec qqn et passe
r au 23/10 ça serait top \n" +
      'Semaine 09/10 - 15/10 : David\n' +
      'Semaine 16/10 - 22/10 : Suzanne\n' +
      '\n' +
      '___________________________________________________________"\n' +
      '\n' +
      'EN COURS :\n' +
      '\n' +
      '*Ce qui bouge à RC : points à mentionner dans le prochain mail de récap à Membres\n' +
``` ?
JohnMcLear commented 4 years ago

My next step will to do while TOTAL_REVS`-- and try to build the pad and see

  1. Which commit broke the pad.

  2. If I can, why.

  3. If I can make it restore that revision reliably.

  4. Good idea? If a pad does start failing, automatically restore it to last working rev?

JohnMcLear commented 4 years ago

Todo:

JohnMcLear commented 4 years ago

Another thing to note is your table wasn't properly set (as utf8mb4). That will have caused problems.

JohnMcLear commented 4 years ago

Note commands I need for ref.

node bin/extractPadDataLastKnownWorking.js affaires-courantes outputnew4 100
select * from `store` where `key` LIKE '%outputnew3%' limit 0,103;
JohnMcLear commented 4 years ago

I fixed up buildPad.js and managed to get an exception

Error: newline count is wrong in op +; cs:Z:1>16x2*0*1*2+15*0|2+2*0*3+6m*0|2+2*0*3+3c*0|2+2*0*3+37*0|4+1q*0*1+i*0|e+iq*0*1*2+a*0|2+2*0*4*5*6+1*0*1+k*0*3+1o*0|1+1*0*4*7*6+1*0*3+x*0|1+1*0*4*8*6+1*0|1+4*0*4*9*6+1*0|1+25*0*4*9*6+1*0|1+1c*0*4*9*6+1*0|1+q*0*4*8*6+1*0|1+r*0*4*a*6+1*0|1+3w*0*4*a*6+1*0|1+43*0*4*8*6+1*0|2+2i*0*4*5*6+1*0*1+d*0|1+1*0*4*8*6+1*0|2+1a*0*4*5*6+1*0*1+1l*0|1+1*0*4*8*6+1*0|2+12*0*4*5*6+1*0*1+z*0|1+1*0*4*8*6+1*0|2+u*0*4*5*6+1*0*1+h*0|1+1*0*4*8*6+1*0|2+19*0*4*5*6+1*0*1+q*0|1+1*0*4*8*6+1*0|2+1i*0*4*5*6+1*0*1+t*0|1+1*0*4*8*6+1*0|2+1h*0*4*5*6+1*0*1+u*0|1+1*0*4*8*6+1*0|1+1a*0*4*8*6+1*0|1+22*0*4*8*6+1*0|1+1j*0*4*8*6+1*0|1+w*0*4*8*6+1*0|2+14*0*4*5*6+1*0*1+f*0|1+1*0*4*8*6+1*0|1+15*0*4*8*6+1*0|1+1y*0*4*8*6+1*0|2+3x*0*4*5*6+1*0*1+p*0|1+1*0*4*8*6+1*0|2+27*0*4*5*6+1*0*1+i*0|1+1*0*4*8*6+1*0|1+2a*0*4*8*6+1*0|1+1c*0*4*9*6+1*0|1+u*0*4*b*6+1*0|1+o*0*4*b*6+1*0|1+e*0*4*9*6+1*0|1+w*0*4*8*6+1*0|1+k*0*4*9*6+1*0|1+x*0*4*9*6+1*0|1+1n*0*4*9*6+1*0|1+7*0*4*b*6+1*0|1+1t*0*4*b*6+1*0|1+1n*0*4*9*6+1*0|1+16*0*4*b*6+1*0|1+2p*0*4*8*6+1*0|1+2q*0*4*8*6+1*0|1+50*0*4*8*6+1*0|1+u*0*4*9*6+1*0|1+2q*0*4*9*6+1*0|1+6*0*4*b*6+1*0|1+4*0*4*9*6+1*0|1+t*0*4*b*6+1*0|1+15*0*4*b*6+1*0|1+1f*0*4*b*6+1*0|1+2i*0*4*b*6+1*0|1+16*0*4*b*6+1*0|1+14*0*4*9*6+1*0|1+2g*0*4*5*6+1*0|3+55*0*1*2+6*0|2+2*0+x*0*1+d*0|2+2*0*4*5*6+1*0*1+7*0|2+2*0*4*5*6+1*0*1+8*0|v+29z*0*3+12*0|6+j8*0*3+5l*0|1+1*0*1+f*0|1+c3*0*1*2+1u*0|4+gj*0*3+24*0|c+1m2*0*3+5l*0|j+2sj*0*3+1p*0|5+fo*0*3+4i*0|1+1*0*3+25*0|3+bj*0*3+c*0|5+c5*0*3+a*0|3+4s*0*3+a*0|3+4i*0*3+c*0|b+iw*0*1+e*0|1+1*0*3+2a*0|6+3l*0*4*5*6+1*0*1+5*0|3+5b*0+4*0*3+61*0|6+im*0+61*0*3+g*0|2+88*0*3+3e*0|12+2ao*0*4*5*6+1*0*1+a*0|5+bw*0*3+1t*0|4+24*0*1*2+a*0|2+2*0*4*5*6+1*0|1+t*0*4*5*6+1*0*1+17*0|1+1*0*4*5*6+1*0*1+1q*0|1+1*0*4*5*6+1*0*1+s*0|3+1u*0*1*2+c*0|2+2*0*4*5*6+1*0*1+w*0|1+1*0*4*8*6+1*0|1+5*0*4*8*6+1*0|1+1*0*4*5*6+1*0*1+19*0|1+1*0*4*5*6+1*0*1+r*0|1+q*0*4*8*6+1*0|1+1*0*4*5*6+1*0*1+1l*0|1+1*0*4*8*6+1*0|1+h*0*4*8*6+1*0|2+2*0*2+e*0|a+94*0*1*2+7*0|9+c4*0*4*5*6+1*0*1+m*0|1+1*0*3+7v*0|2+cr*0*4*7*6+1*0|1+7e*0*3+r*0|m+1vj*0*3+oz*0|3+3*0*4*5*6+1*0*1+1z*0|1+1*0*3+84*0|4+39*0*4*5*6+1*0*1+11*0|1+1*0*3+1o*0|a+77*0+2*0*3+8m*0|1+1*0*3+66*0|2+2*0*4*5*6+1*0*1+14*0|1+1*0*3+3z*0|3+3*0*4*5*6+1*0*1+j*0|9+6b*0*4*5*6+1*0*1+17*0|1+1*0*3+w*0|6+f8*0*4*5*6+1*0*1+c*0|1+1*0*3+4z*0|3+3*0*4*5*6+1*0*1+3c*0|1+1*0*3+s*0|1+1*0*3+o*0|4+m*0*3+s*0|1+1*0*3+1y*0|4+12*0*4*5*6+1*0*1+s*0|1+1*0*3+2f*0|3+3*0*4*5*6+1*0*1+12*0|1+1*0*3+2e*0|c+c1*0*4*5*6+1*0*1+x*0|1+1*0*3+x*0|1+1*0*3+b*0|3+m*0*1+m*0|4+67*0*4*5*6+1*0*1+1i*0|4+w*0*4*5*6+1*0*1+r*0|1+1*0*1+1u*0|1+1*0*1*c+i*0*1+a*0|1+1*0*1*c+i*0*1+a*0|1+1*0*1+n*0|3+3*0*3+8s*0|2+2*0*4*5*6+1*0*1+2d*0|1+1*0*3+28*0|2+2*0*4*5*6+1*0*1+21*0|1+1*0*3+83*0|e+kx*0*4*5*6+1*0*1+1e*0|1+1*0*3+1c*0|1+1*0*3+3i*0|e+wt*0*4*5*6+1*0*1+2x*0|1+1*0*3+18*0|f+uh*0*4*5*6+1*0*1+1u*0|e+h6*0*4*5*6+1*0*1+1c*0|1+1*0*3+3d*0|b+h9*0*4*5*6+1*0*1+1h*0|d+gq*0*4*5*6+1*0*1+1t*0|b+go*0*4*5*6+1*0*1+g*0+1*0*1+5*0|1+1*0*3+23*0|g+p6*0*4*5*6+1*0*1+1p*0|d+dc*0*4*5*6+1*0*1+10*0|1+1*0*3+3r*0|6+3p*0+2*0*3+27*0|3+3*0*4*5*6+1*0*1+o*0|5+2q*0*4*5*6+1*0*1+20*0|b+hg*0*4*5*6+1*0*1+1p*0|4+d*0*c+4a*0|1+1*0*c+63*0|8+8p*0*c+4*0+1*0*3+65*0|2+26*0*4*5*6+1*0*1+1q*0|9+3n*0*4*5*6+1*0*1+2d*0|1+1*0*3+s*0|1+1*0*3+7d*0|d+18g*0*4*5*6+1*0*1+1f*0|b+fw*0*4*5*6+1*0*1+q*0|1+1*0*3+u*0|3+3*0*4*5*6+1*0*1+12*0|5+25*0*1+h*0|1e+32a*0*3+56*0|i+kl*0*1*2+c*0|3+4c*0*4*8*6+1*0*1+f*0|1+15*0*4*8*6+1*0|1+1i*0*4*8*6+1*0|1+1o*0*4*8*6+1*0|1+1d*0*4*8*6+1*0|1+3b*0*4*8*6+1*0|1+11*0*4*8*6+1*0|1+1k*0*4*8*6+1*0|1+20*0*4*8*6+1*0|1+1j*0*4*8*6+1*0|1+22*0*4*8*6+1*0|1+2c*0*4*8*6+1*0|1+1f*0*4*8*6+1*0|1+1f*0*4*8*6+1*0|1+1q*0*4*8*6+1*0|1+1i*0*4*8*6+1*0|1+1m*0*4*8*6+1*0|1+2d*0*4*8*6+1*0|1+1q*0*4*8*6+1*0|1+23*0*4*8*6+1*0|1+1c*0*4*8*6+1*0|1+3g*0*4*8*6+1*0|1+1q*0*4*8*6+1*0|1+20*0*4*8*6+1*0|1+1n*0*4*8*6+1*0|1+2e*0*4*8*6+1*0|1+2e*0*4*8*6+1*0|1+2u*0*4*8*6+1*0|1+27*0*4*8*6+1*0|1+3g*0*4*8*6+1*0*1+s*0|1+1h*0*4*8*6+1*0|1+1o*0*4*8*6+1*0*1+17*0|1+17*0*4*8*6+1*0|1+22*0*4*8*6+1*0*1+p*0|1+1i*0*4*8*6+1*0|1+1x*0*4*8*6+1*0|1c+3h$En cours de traitement à Regards Citoyens
JohnMcLear commented 4 years ago

fwiw there is a rebuildPad.js but that's failing for me on your pad. I didn't test it against my intentionally broken pad.

JohnMcLear commented 4 years ago

As per #3991 I think to do a restoration/rebuild you need these values else changeset ops wont work. Once the merge is complete I can continue work on my script/branch but it looks like any pads with revs(@100) edited before the merge is complete and in place wont be able to actually be rebuild rendering both existing methods pointless.

JohnMcLear commented 4 years ago

As #3991 is merged this gives us an awesome tool for recoveries I can go ahead and close this. If we get another report we should be able to recover upon request and now the tools are available to debug/diagnose and recover.