Various attempts to do the right thing has resulted in a mix of character encodings both among and within wiki pages. We have scattered resources for resolving many if not most of these issues which we will catalog here. We will lump in a few other poorly defined losses which might share solutions. This is probably the biggest blocker for #2 and the restoration implied there.
[ ] filesystem errors, fsck failure and raid resync
[x] robot edit wars, weeks of conflict before going read-only
[ ] tar file omissions, tar-t reports 33 more files than tar-x makes
[x] ruby encoding errors, rescue skips about 1000 files
[ ] unwanted encodings, such as �, =XX, and &XX; in pages
To this list I will add possible remediations and check them off as they are fully applied. Suggestions are welcome. Don't be concerned if I remove comments once they have been understood and incorporated in this list. I will also remove remediations that don't work out.
[ ] consult filesystem level backups for missing files
[x] consult application level history for missing or abused pages
[x] convert or ignore bad utf-8 characters using ruby's encoding mechanisms
[ ] consult textfiles and other scraped markup archives
[ ] consult archive.org or other scraped html archives
Finally I will assemble a list of suspicious pages that illustrate some malfunction that will serve as test cases for improved algorithms and workflows.
Various attempts to do the right thing has resulted in a mix of character encodings both among and within wiki pages. We have scattered resources for resolving many if not most of these issues which we will catalog here. We will lump in a few other poorly defined losses which might share solutions. This is probably the biggest blocker for #2 and the restoration implied there.
To this list I will add possible remediations and check them off as they are fully applied. Suggestions are welcome. Don't be concerned if I remove comments once they have been understood and incorporated in this list. I will also remove remediations that don't work out.
Finally I will assemble a list of suspicious pages that illustrate some malfunction that will serve as test cases for improved algorithms and workflows.