ScientificPublishing / SciPub

2 stars 0 forks source link

Extracting Citations #3

Open artydont opened 1 year ago

artydont commented 1 year ago

I just googled for title above and tried out the recommendation to use:

https://anystyle.io/

made in:

https://update.lib.berkeley.edu/2018/02/07/extracting-references-from-an-already-created-bibliography/

and in linked Zotero doc:

https://www.zotero.org/support/kb/importing_formatted_bibliographies

Produced attached Bibtext .bib file from sample of first 8 entries cut and pasted from Maksakovsky pdf.

Suffix .txt added to filename to attach here.

anystyle.bib.txt

Suggest @DavidMc1948 should try it out on whole of both bibliographies in Maksakovsky. (Note I modified some entries which it may now handle automatically for [no date] and [Moscow no publisher].

Also suggest @Ted1307 should do as much as feasible from large Gollobin bibliography in small batches to upload here for checking there is no problem.

I used a small batch and had to fix the line breaks so it would recognize there were 8 items.

PS See also new comment re Joplin I am about to add to Markdown Editor Issue

artydont commented 1 year ago

Gollobin bibliography is a good initial choice as it will certainly be useful for many things and will later need to be linked in both directions to citations in Gollobin Notes and names and topics is Index.

While waiting for a Wordprocessor OCR file that includes full bibliography try out doing next small batch from Maksakovsky. First small batch above will be renamed "mak.bib.01.txt" and next batch should be in same format with suffixes bib.02.txt and fmt.02.txt for the two output files. The small numbered batches can easily be combined later with above name scheme. Two digits used to ensure that large bibliography can have more than 9 batches.

Starting point for all later processing of anything is extracting the references into a standard .bib file plus the additional details format file ('fmt") offered from use of https://anystyle.io/ as mentioned above which has full explanations to learn by doing. eg Start with next batch after above from Maksakovsky for practice while waiting for Gollobin.

Proper tutorial for anybody to do this should be written after experience with learning how to do it here.

Meanwhile first steps are:

  1. Cut and paste the full bibliography from best available OCR of the work from a wordprocessor OCR output that preserves the boldface and italic and underline formatting into a text editor that also preserves such formatting (eg a Zotero standalone note - the "Aa" button at top gives a "Format Text" menu). I only used plain text with no formatting in above sample but I expect the parsing benefits from being able to recognize that a title is in italics etc.
  2. Ensure each item is a single paragraph with no line breaks and one blank line between items. Errors can be fixed while online but best to do this in a wordprocessor or text editor that can easily wrap line breaks.
  3. Perhaps check file for OCR errors by comparison with original scan and with index of names (since names are likely to not be in dictionary used by OCR software and therefore more likely to have OCR errors).
  4. Follow instructions at anyfile.io - it works well. Best to do all the extraction a small batch at a time and upload the initial .bib.nn.txt and .fmt.nn.txt files here for checking. We want to capture as much detail as possible in case it becomes useful later (eg separation of first and last names of each author, editor and translator when it is available from the automated parsing

Can simply attach as comments added to this Issue like above anystyle.bib.txt that will get deleted after we work out a system for proper processing and storing.

Both the person uploading and anybody else can check against originals by using wordprocessor to print the formatted bibliography in the same "style" as was used in original and comparing with that section of original. Also note the style to include in provenance details for finalized result of all batches.

Remaining errors can be corrected after importing batch to a Zotero subcollection or other citation editor. This is better than editing the .bib or detailed format file directly in a normal text editor.

Later it will be used to add missing fields such as LCCN, ISBN, DOI, hdl, md5 etc by actually tracking down each item to a public catalog entry and eventually to a URN that can be used to directly access the item for automated retrieval.

All later steps after initial extraction will have to be worked out as we proceed so that they become simple tutorials plus automated software for others to use.

But initial extraction using anystyle.io online is certainly the first step.

Ted1307 commented 1 year ago

It seems that I did receive this but I didn't see it at the time. I only found it just now by searching for the title "Extracting citations" in Outlook. I don't know if was in my In box, (and I missed it) or went to Junk mail. Craig


From: artydont @.> Sent: Tuesday, 24 October 2023 12:11 AM To: ScientificPublishing/SciPub @.> Cc: Ted1307 @.>; Mention @.> Subject: Re: [ScientificPublishing/SciPub] Extracting Citations (Issue #3)

Gollobin bibliography is a good initial choice as it will certainly be useful for many things and will later need to be linked in both directions to citations in Gollobin Notes and names and topics is Index.

While waiting for a Wordprocessor OCR file that includes full bibliography try out doing next small batch from Maksakovsky. First small batch above will be renamed "mak.bib.01.txt" and next batch should be in same format with suffixes bib.02.txt and fmt.02.txt for the two output files. The small numbered batches can easily be combined later with above name scheme. Two digits used to ensure that large bibliography can have more than 9 batches.

Starting point for all later processing of anything is extracting the references into a standard .bib file plus the additional details format file ('fmt") offered from use of https://anystyle.io/ as mentioned above which has full explanations to learn by doing. eg Start with next batch after above from Maksakovsky for practice while waiting for Gollobin.

Proper tutorial for anybody to do this should be written after experience with learning how to do it here.

Meanwhile first steps are:

  1. Cut and paste the full bibliography from best available OCR of the work from a wordprocessor OCR output that preserves the boldface and italic and underline formatting into a text editor that also preserves such formatting (eg a Zotero standalone note - the "Aa" button at top gives a "Format Text" menu). I only used plain text with no formatting in above sample but I expect the parsing benefits from being able to recognize that a title is in italics etc.
  2. Ensure each item is a single paragraph with no line breaks and one blank line between items. Errors can be fixed while online but best to do this in a wordprocessor or text editor that can easily wrap line breaks.
  3. Perhaps check file for OCR errors by comparison with original scan and with index of names (since names are likely to not be in dictionary used by OCR software and therefore more likely to have OCR errors).
  4. Follow instructions at anyfile.io - it works well. Best to do all the extraction a small batch at a time and upload the initial .bib.nn.txt and .fmt.nn.txt files here for checking. We want to capture as much detail as possible in case it becomes useful later (eg separation of first and last names of each author, editor and translator when it is available from the automated parsing

Can simply attach as comments added to this Issue like above anystyle.bib.txthttps://github.com/ScientificPublishing/SciPub/files/12818736/anystyle.bib.txt that will get deleted after we work out a system for proper processing and storing.

Both the person uploading and anybody else can check against originals by using wordprocessor to print the formatted bibliography in the same "style" as was used in original and comparing with that section of original. Also note the style to include in provenance details for finalized result of all batches.

Remaining errors can be corrected after importing batch to a Zotero subcollection or other citation editor. This is better than editing the .bib or detailed format file directly in a normal text editor.

Later it will be used to add missing fields such as LCCN, ISBN, DOI, hdl, md5 etc by actually tracking down each item to a public catalog entry and eventually to a URN that can be used to directly access the item for automated retrieval.

All later steps after initial extraction will have to be worked out as we proceed so that they become simple tutorials plus automated software for others to use.

But initial extraction using anystyle.io online is certainly the first step.

— Reply to this email directly, view it on GitHubhttps://github.com/ScientificPublishing/SciPub/issues/3#issuecomment-1776251119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCJT7UNL56WK47Z6BM2LQWLYA4BSVAVCNFSM6AAAAAA5UIJ6LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZWGI2TCMJRHE. You are receiving this because you were mentioned.