Open artydont opened 1 year ago
Has sections that need to be explored thoroughly:
That is probably what needs to be explored first, along with the techniques used by wikimedia, archive.org, guttenberg.org, hathi, google books, marxists.org etc.
Github itself directly supports various forms of documentation publishing and hosts numerous projects that can be found under various tags, but above is where I will start exploring for "know how" first.
Via browsing above I have a lot more information re OCR proof reading etc than I want to know or summarize.
May attempt to summarize it later.
For now my best guess is that for short, medium and perhaps long term best approach would be to find a librarian that has access to equipment used for high quality scanning and subsequent OCR and ask them to process the file that has poor OCR before starting proof reading.
The mass digitization projects such as Google Books are setup for fully automated mass scanning and have the tools to avoid need for manual OCR proof reading or at least minimize it. Reference libraries participate in those projects.
The open source tools may work with software we can get hold of but I am not optimistic about getting it operating until long term when there is a much bigger project with a regular flow of OCR work.
Could remove most of the errors automagically but is complex to setup and might require skill to operate.
Other possibility is that it would be better to produce a higher quality scan from hardcopy of the book, but I doubt it.
The same librarian that can process it should be able to quickly determine the scan resolution that was used and suggest whether a rescan would help (but request that in addition, not instead of full OCR processing of existing file).
If the result is in the same form that the library would send to archive.org or to Hathi or to Google books that could be sufficient to do further work on it without wasted effort.
Various Australian libraries would be archiving their entire collections with Hathi as backup storage in case of disaster.
Your average generalist librarian wouldn't be up to this. The only librarians I knew of who did this sort of work were in Special Collections, where they scanned, catalogued and indexed documents of historical significance. They all had a background in cataloguing. I imagine there would be archivists in state and federal archives doing the same thing.
From: artydont @.> Sent: Monday, 11 September 2023 2:53 PM To: ScientificPublishing/SciPub @.> Cc: Ted1307 @.>; Mention @.> Subject: Re: [ScientificPublishing/SciPub] Markdown Editor (Issue #1)
Via browsing above I have a lot more information re OCR proof reading etc than I want to know or summarize.
May attempt to summarize it later.
For now my best guess is that for short, medium and perhaps long term best approach would be to find a librarian that has access to equipment used for high quality scanning and subsequent OCR and ask them to process the file that has poor OCR before starting proof reading.
The mass digitization projects such as Google Books are setup for fully automated mass scanning and have the tools to avoid need for manual OCR proof reading or at least minimize it. Reference libraries participate in those projects.
The open source tools may work with software we can get hold of but I am not optimistic about getting it operating until long term when there is a much bigger project with a regular flow of OCR work.
Could remove most of the errors automagically but is complex to setup and might require skill to operate.
Other possibility is that it would be better to produce a higher quality scan from hardcopy of the book, but I doubt it.
The same librarian that can process it should be able to quickly determine the scan resolution that was used and suggest whether a rescan would help (but request that in addition, not instead of full OCR processing of existing file).
If the result is in the same form that the library would send to archive.org or to Hathi or to Google books that could be sufficient to do further work on it without wasted effort.
Various Australian libraries would be archiving their entire collections with Hathi as backup storage in case of disaster.
— Reply to this email directly, view it on GitHubhttps://github.com/ScientificPublishing/SciPub/issues/1#issuecomment-1714051959, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCJT7UMUZJ6GJTKVBWVWYJDXZ4QV3ANCNFSM6AAAAAA4SVAWZE. You are receiving this because you were mentioned.Message ID: @.***>
Hi Tom
Hi Arthur, It's taken me a while but I've finally found out where the messages are hidden. Ted
All good. Receiving emails. Subscribed to notifications. David
On Fri, Sep 22, 2023 at 3:37 PM Ted1307 @.***> wrote:
Hi Arthur, It's taken me a while but I've finally found out where the messages are hidden. Ted
— Reply to this email directly, view it on GitHub https://github.com/ScientificPublishing/SciPub/issues/1#issuecomment-1730828475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYZQC6XQ2HX7YOFF4CJEHTX3UPZTANCNFSM6AAAAAA4SVAWZE . You are receiving this because you were mentioned.Message ID: @.***>
David recommended Joplin as both Markdown editor and notes organizer:
Looks really great to me. Works on all platforms and provides the simple minimum required of live preview while editing.
Should be easy to combine with Zotero workflows and github.
I have not read docs but intend to start using it instead of previous recommendations and to study the docs while I do.
https://joplinapp.org/clipper/
https://github.com/joplin/plugins
There will be better advice online than the random example below of combined use:
https://klemet.github.io/Workshop-Organization-EN/10-example.html
PS I just posted a new Issue 3 on Extracting Citations
Please verify that you received it via email before clicking above link. Let me know if you did not as it means you are only subscribed to this Issue and not to new threads.
Hi to @PetrogradXXII, @Ted1307, and @DavidMc1948
I don't know whether Ted and David will see this as it is the first issue in the only repo of ScientificPublishing.
Petro mentioned that he will be busy for a few days and will then follow up on whether everybody is on board and automatically gets emails from comments on Issues in this repo and knows how to turn off topics they do not want further emails about:
https://github.com/ScientificPublishing/SciPub
Then it will not be necessary to mention (using @) the usernames of the group as a group for "@all" will be "watching".
Meanwhile I need to correct two bits of bad advice I gave in previous message to all above from an issue in my own repo.
https://github.com/artydont/sciphil/issues/1#issuecomment-1711712376
Post Correction Tool is interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes..
Tesseract looks far too complex for us right now so I am not checking it out and hoping that we soon get somebody better able to set it up for us on board.
I intend to focus on learning to use Zotero properly, as it will be central to most future librarian type work that ted can help with. Planning to wade or at least skim through all the extensive Zotero documentation including third party docs
So far I did find How to create a workflow in Zotero will try to write a separate note about that and other stuff from same author eventually.