ScientificPublishing / SciPub

2 stars 0 forks source link

Proof Reading for book in #11 #12

Open artydont opened 1 year ago

artydont commented 1 year ago

(Sigh) tl;dr version

Please note page numbers that you happen to notice typos on when reading the work linked in #11 and also note which range of pages you actually did check for typos while reading.

Don't start editing anything until page numbers have been fixed in files I will upload to repo with prefix "tpbm".

Original rant below.


Everybody intending to work on Issue #11 should now be reading project "tpbm" from either the original .pdf or the latest .docx links in #11. Don't start any editing at all until I upload files to a repo, using that prefix.

In particular we will need to fix the page breaks and possibly the headers and footers as the pagination used by the publisher that produced good OCR does not correspond to scan of the original printed publication:

https://archive.org/details/ShirokovM.ATextbookOfMarxistPhilosophy

I will figure out a rapid way to fix the page breaks, which also affect the paragraph breaks.

While reading, please do note page numbers of any obvious potential typos. There are very few but I did notice some while reading the paperback and did not make notes. But don't fix them, just list them to merge with lists from others later.

I just noticed one on p204.

Accurate original pagination is necessary for linking annotations to their targets (or special tables linking multiple pagination as will be necessary because the publisher we are working from failed to do that so there will already be references to page numbers in paperback version that differ from older references to page numbers in 1930s version). The eventual output may have entirely different pagination but will retain anchor points precisely at the original page breaks for use by annotations and will enable readers to see and goto the original page numbers.

As can be seen on that page the archive has maintained full tracking of the "provenance" and documented that it is available as https://creativecommons.org/publicdomain/mark/1.0/

We don't add further problems but only remove them. So we will have to implement careful documentation of each step taken from input to output, including separate steps for identifying typos and for verifying that the corrections do now correspond to the original. The actual correcting and verifying could be done in batches and should be done by separate individuals. But no actual changes should be made until we have fixed the pagination.

The fact of the actual error and correction need not be recorded in lists of typos as the fact of typo changes like all other changes we make to a file will be captured by git when each batch of changes is committed.

More precisely the typo I noted is at 204.4.15. There is probably no need at this stage to provide more than the page number as OCR typos are pretty obvious and may not even need to be verified against the original scan, let alone original printed copy.

But we should also keep track of "provenance" in detail for each step we take. So that typo will end up being compared with the original scan at p251 pdf 232 of 283:

As a result of our efforts the dash and black square between the word "proletarians" at word 15 and the next word "people" in paragraph 4 on page 204 of our input version will be replaced by a long dash after word 15 in paragraph 1 of p251 in our output version (which will have the same pdf page numbering and visible page numbers as the original scan of 1930s printed copy which is still in many libraries and has many references to it using original page numbers).

The list of typos in the provenance record would initially record 204.4.15 and subsequently add after both correction and verification 251.1.15. If you at least add the para number when recording the page number of the error it will be quicker both for somebody else to verify the eventual correction and to produce the provenance record.

Eventually the process of page renumbering may also produce a table and tool that enables verifier (and corrector) to go directly from a suspected typo to the original scan for close inspection instead of having to find it. But that probably won't be available until after annotations actually need it to find anchor points on publications that are less accessible. We will have to do without that until then but fortunately there are very few OCR typos in this.

I am numbering paragraphs starting from 1 if the first line on a page starts a new paragraph and from 0 if it is a continuation of a paragraph started at end of previous page..

I am counting words hyphenated at line breaks as one word, not two. I am not using line numbers at all.

I expect we will not be able to maintain the same line breaks easily and will leave that to whoever subsequently wants to add further enhancement. This may also complicate page breaks since they are also line breaks. Suggestions welcome.

But by maintaining a full provenance record we make it easy for anyone to improve things where we left off and without needing to duplicate or even verify what we have already done.

I am hopeful that just fixing page breaks will be sufficient to correlate annotations.

But don't start other work until we HAVE fixed page breaks. Just read! It is a very worthwhile book which will require renewal and improvement after careful study.