ScientificPublishing / SciPub

2 stars 0 forks source link

Markdown Editor #1

Open artydont opened 1 year ago

artydont commented 1 year ago

Hi to @PetrogradXXII, @Ted1307, and @DavidMc1948

I don't know whether Ted and David will see this as it is the first issue in the only repo of ScientificPublishing.

Petro mentioned that he will be busy for a few days and will then follow up on whether everybody is on board and automatically gets emails from comments on Issues in this repo and knows how to turn off topics they do not want further emails about:

https://github.com/ScientificPublishing/SciPub

Then it will not be necessary to mention (using @) the usernames of the group as a group for "@all" will be "watching".

Meanwhile I need to correct two bits of bad advice I gave in previous message to all above from an issue in my own repo.

https://github.com/artydont/sciphil/issues/1#issuecomment-1711712376

  1. Petro has confirmed that Github Desktop should be used by all so I will start using it.
  2. We need to recommend a project standard Markdown editor that new users on any platform can easily start using. But I have now tried out MarkText and am not happy with it because it does not have "Live Preview" in a separate window pane. I am used to using a very simple Markdown editor, Apostrophe which can show side by side windows with the Markdown text you type in the left hand pane and simultaneously how it will look when rendered to web page or print in the right hand pane. Unfortunately Apostrophe is not available on other platforms and I don't know what is.
  3. For more complex features (that will eventually be required for OCR correction, such as dictionary spell checks) I have checked out Sublime Text but it is too complex for changing settings.
  4. Visual Studio Code is offered by Github at the same time as it suggests Github desktop and I now think it is the best choice when more than the minimum is needed.
  5. Just getting familiar with markdown can simply be done by editing files in your own repo through the browser. But eventually it will be necessary to use Visual Studio. It DOES provide far MORE than needed but this includes everything useful for simple Markdown editing.
  6. A clear "tutorial" on how to quickly setup the "document writer profile" before actually doing anything useful and without understanding either of those links (which assume the reader is a software developer rather than just using Visual Code as a wordprocessor).
  7. Meanwhile I did get that far and others can phone me to talk them through getting past that initial hurdle.
  8. I am pretty sure ted will end up needing to use it for OCR correction and will not end up wanting to know how to do that initial setup unassisted.
  9. BTW the far more sophisticated Tesseract software available for OCR correction has many add-on tools including:

Post Correction Tool is interactive post-correction of OCRed documents. Using the information obtained by the Text and Error Profiler the whole correction process is adaptive to the document being processed. In this way, usually huge numbers of systematic errors can be corrected with just a few keystrokes..

  1. Tesseract looks far too complex for us right now so I am not checking it out and hoping that we soon get somebody better able to set it up for us on board.

  2. I intend to focus on learning to use Zotero properly, as it will be central to most future librarian type work that ted can help with. Planning to wade or at least skim through all the extensive Zotero documentation including third party docs

  3. So far I did find How to create a workflow in Zotero will try to write a separate note about that and other stuff from same author eventually.

artydont commented 1 year ago

https://www.digitisation.eu/

Has sections that need to be explored thoroughly:

  1. Resources for translation work
  2. Knowledge including Digitisation Tools workflow
  3. Community has a developer hub including the OCR post correction tool and 22 other software repos likely to be directly relevant to Scientific Publishing.

That is probably what needs to be explored first, along with the techniques used by wikimedia, archive.org, guttenberg.org, hathi, google books, marxists.org etc.

Github itself directly supports various forms of documentation publishing and hosts numerous projects that can be found under various tags, but above is where I will start exploring for "know how" first.

artydont commented 1 year ago

Via browsing above I have a lot more information re OCR proof reading etc than I want to know or summarize.

May attempt to summarize it later.

For now my best guess is that for short, medium and perhaps long term best approach would be to find a librarian that has access to equipment used for high quality scanning and subsequent OCR and ask them to process the file that has poor OCR before starting proof reading.

The mass digitization projects such as Google Books are setup for fully automated mass scanning and have the tools to avoid need for manual OCR proof reading or at least minimize it. Reference libraries participate in those projects.

The open source tools may work with software we can get hold of but I am not optimistic about getting it operating until long term when there is a much bigger project with a regular flow of OCR work.

Could remove most of the errors automagically but is complex to setup and might require skill to operate.

Other possibility is that it would be better to produce a higher quality scan from hardcopy of the book, but I doubt it.

The same librarian that can process it should be able to quickly determine the scan resolution that was used and suggest whether a rescan would help (but request that in addition, not instead of full OCR processing of existing file).

If the result is in the same form that the library would send to archive.org or to Hathi or to Google books that could be sufficient to do further work on it without wasted effort.

Various Australian libraries would be archiving their entire collections with Hathi as backup storage in case of disaster.

Ted1307 commented 1 year ago

Your average generalist librarian wouldn't be up to this. The only librarians I knew of who did this sort of work were in Special Collections, where they scanned, catalogued and indexed documents of historical significance. They all had a background in cataloguing. I imagine there would be archivists in state and federal archives doing the same thing.


From: artydont @.> Sent: Monday, 11 September 2023 2:53 PM To: ScientificPublishing/SciPub @.> Cc: Ted1307 @.>; Mention @.> Subject: Re: [ScientificPublishing/SciPub] Markdown Editor (Issue #1)

Via browsing above I have a lot more information re OCR proof reading etc than I want to know or summarize.

May attempt to summarize it later.

For now my best guess is that for short, medium and perhaps long term best approach would be to find a librarian that has access to equipment used for high quality scanning and subsequent OCR and ask them to process the file that has poor OCR before starting proof reading.

The mass digitization projects such as Google Books are setup for fully automated mass scanning and have the tools to avoid need for manual OCR proof reading or at least minimize it. Reference libraries participate in those projects.

The open source tools may work with software we can get hold of but I am not optimistic about getting it operating until long term when there is a much bigger project with a regular flow of OCR work.

Could remove most of the errors automagically but is complex to setup and might require skill to operate.

Other possibility is that it would be better to produce a higher quality scan from hardcopy of the book, but I doubt it.

The same librarian that can process it should be able to quickly determine the scan resolution that was used and suggest whether a rescan would help (but request that in addition, not instead of full OCR processing of existing file).

If the result is in the same form that the library would send to archive.org or to Hathi or to Google books that could be sufficient to do further work on it without wasted effort.

Various Australian libraries would be archiving their entire collections with Hathi as backup storage in case of disaster.

— Reply to this email directly, view it on GitHubhttps://github.com/ScientificPublishing/SciPub/issues/1#issuecomment-1714051959, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCJT7UMUZJ6GJTKVBWVWYJDXZ4QV3ANCNFSM6AAAAAA4SVAWZE. You are receiving this because you were mentioned.Message ID: @.***>

artydont commented 11 months ago

Hi Tom

Ted1307 commented 11 months ago

Hi Arthur, It's taken me a while but I've finally found out where the messages are hidden. Ted

DavidMc1948 commented 11 months ago

All good. Receiving emails. Subscribed to notifications. David

On Fri, Sep 22, 2023 at 3:37 PM Ted1307 @.***> wrote:

Hi Arthur, It's taken me a while but I've finally found out where the messages are hidden. Ted

— Reply to this email directly, view it on GitHub https://github.com/ScientificPublishing/SciPub/issues/1#issuecomment-1730828475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKYZQC6XQ2HX7YOFF4CJEHTX3UPZTANCNFSM6AAAAAA4SVAWZE . You are receiving this because you were mentioned.Message ID: @.***>

artydont commented 11 months ago

David recommended Joplin as both Markdown editor and notes organizer:

Looks really great to me. Works on all platforms and provides the simple minimum required of live preview while editing.

Should be easy to combine with Zotero workflows and github.

I have not read docs but intend to start using it instead of previous recommendations and to study the docs while I do.

https://joplinapp.org/

https://joplinapp.org/help/

https://joplinapp.org/clipper/

https://github.com/joplin/plugins

There will be better advice online than the random example below of combined use:

https://klemet.github.io/Workshop-Organization-EN/10-example.html

PS I just posted a new Issue 3 on Extracting Citations

Please verify that you received it via email before clicking above link. Let me know if you did not as it means you are only subscribed to this Issue and not to new threads.