gbhl / bhl-segment-definition

Track the definition of articles and chapters within items in the Biodiversity Heritage Library (BHL) collection.
Other
2 stars 0 forks source link

The University of Kansas science bulletin #36

Open swlny opened 4 months ago

swlny commented 4 months ago
Key Value
Title The University of Kansas science bulletin
BHL Title ID 3179
ISSN 0022-8850
Thumbnail
Segmentation Many but not all articles defined in BHL. Some articles have old BHL-acquired DOIs.
Example DOI 10.5962/bhl.part.24549
WikiData Q21385501
emnybg commented 2 months ago

I will do this title.

rdmpage commented 2 months ago

@emnybg I did a quick test of my code that uses ChatGPT to try and locate articles and it seemed to do a reasonable job on the one item I tried https://www.biodiversitylibrary.org/item/21977. The RIS output is below (obvs. some tweaks required). Let me know if this is useful, because I could probably tun the whole journal through this tool to get a starting point for further cleaning, etc.

TY  - JOUR
TI  - ANTIGENIC AND METABOLIC STUDIES OF BACILLUS TYPHOSUS.*
AU  - Cornelia M. Downs
JO  - Science Bulletin
VL  - XVI
IS  - 1
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991163
N2  - 2991163,2991164,2991165,2991166,2991167,2991168,2991169,2991170,2991171,2991172,2991173,2991174,2991175,2991176,2991177,2991178,2991179,2991180,2991181,2991182,2991183,2991184,2991185,2991186,2991187,2991188,2991189,2991190,2991191,2991192,2991193,2991194,2991195,2991196,2991197,2991198,2991199,2991200,2991201,2991202,2991203,2991204,2991205,2991206,2991207,2991208,2991209,2991210,2991211,2991212,2991213,2991214,2991215,2991216,2991217,2991218,2991219,2991220,2991221,2991222,2991223,2991224,2991225,2991226,2991227,2991228,2991229,2991230,2991231,2991232,2991233,2991234,2991235,2991236,2991237,2991238,2991239,2991240,2991241,2991242,2991243,2991244,2991245,2991246,2991247
SP  - 5
EP  - 90
ER  - 

TY  - JOUR
TI  - STUDIES ON BACILLUS PYOCYANEUS
AU  - Noble P. Sherwood
AU  - T. L. Johnson
AU  - Ida Radotincky
JO  - The University of Kansas Science Bulletin
VL  - XVI
IS  - 2
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991249
N2  - 2991249,2991250,2991251,2991252,2991253,2991254,2991255,2991256,2991257
SP  - 91
EP  - 100
ER  - 

TY  - JOUR
TI  - THE GENUS ERYTHRONEURA NORTH OF MEXICO
AU  - William Robinson
JO  - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN
VL  - XVI
IS  - 3
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991259
N2  - 2991259,2991260,2991261,2991262,2991263,2991264,2991265,2991266,2991267,2991268,2991269,2991270,2991271,2991272,2991273,2991274,2991275,2991276,2991277,2991278,2991279,2991280,2991281,2991282,2991283,2991284,2991285,2991286,2991287,2991288,2991289,2991290,2991291,2991292,2991293,2991294,2991295,2991296,2991297,2991298,2991299,2991300,2991301,2991302,2991303,2991304,2991305,2991306,2991307,2991308,2991309,2991310,2991311,2991312
SP  - 101
EP  - 156
ER  - 

TY  - JOUR
TI  - STUDIES ON THE EGGS OF SOME REDUVIIDÆ (HETEROPTERA)
AU  - P. A. Readio
JO  - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN
VL  - XVI
IS  - 4
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991315
N2  - 2991315,2991316,2991317,2991318,2991319,2991320,2991321,2991322,2991323,2991324,2991325,2991326,2991327,2991328,2991329,2991330,2991331,2991332,2991333,2991334,2991335,2991336
SP  - 157
EP  - 180
ER  - 

TY  - JOUR
TI  - THE NATURE, ORIGIN AND SIGNIFICANCE OF PIGMENT IN EMBRYOS OF AMBLYSTOMA
AU  - Hervey S. Faris
JO  - The University of Kansas Science Bulletin
VL  - XVI
IS  - 5
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991339
N2  - 2991339,2991340,2991341,2991342,2991343,2991344,2991345,2991346,2991347,2991348,2991349,2991350,2991351,2991352,2991353,2991354,2991355,2991356,2991357,2991358,2991359,2991360,2991361,2991362,2991363,2991364,2991365,2991366,2991367,2991368,2991369,2991370,2991371,2991372,2991373,2991374,2991376,2991377,2991378,2991379,2991380,2991381,2991382,2991383,2991384
SP  - 181
EP  - 228
ER  - 

TY  - JOUR
TI  - ON A NEARLY COMPLETE LIZARD SKULL FROM THE OLIGOCENE OF NEBRASKA
AU  - Charles W. Gilmore
JO  - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN
VL  - XVI
IS  - 6
PY  - 1926-03-01
SN  - 0022-8850
UR  - https://biodiversitylibrary.org/page/2991387
N2  - 2991387,2991388,2991389,2991390
SP  - 229
EP  - 231
ER  - 
emnybg commented 2 months ago

Rod,

I’m still gobsmacked by what ChatGTP can do (athough sometimes it can be a little stubborn). I’ve been feeding it TOC OCR one item at a time. It’s working, and infinitely faster than what I was doing before, but this is another order of magnitude. I just asked ChatGTP to convert yours to table format, change the volume number Roman numerals to Arabic, reverse author names to Last, First, and format the title in sentence case. That is probably all the cleaning up it needs. I estimate about 1/3 of the articles are already defined, via BioStor, which means many of the authors will be an easy lookup as well.

Curious if you are using the TOC pages to find articles? There are variations between volumes that might cause problems.

Interesting that you’re getting different journal names within the same volume. FWIW, The University of Kansas science bulletin, title id 3179, v16-55, used to be the Kansas University science bulletin, 15415, v1-15. They have different ISSNs, but either ISSN pulls down all the articles from both from Crossref (thanks Nicole for the CoLab code!).

So yes, if you can give me output for both title ids 3179 and 15415 it would be fantastic. Is this code that you could eventually put up on CoLab?

Ellen

On Apr 26, 2024, at 6:46 AM, Roderic Page @.***> wrote:

@emnybg https://github.com/emnybg I did a quick test of my code that uses ChatGPT to try and locate articles and it seemed to do a reasonable job on the one item I tried https://www.biodiversitylibrary.org/item/21977. The RIS output is below (obvs. some tweaks required). Let me know if this is useful, because I could probably tun the whole journal through this tool to get a starting point for further cleaning, etc.

TY - JOUR TI - ANTIGENIC AND METABOLIC STUDIES OF BACILLUS TYPHOSUS.* AU - Cornelia M. Downs JO - Science Bulletin VL - XVI IS - 1 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991163 N2 - 2991163,2991164,2991165,2991166,2991167,2991168,2991169,2991170,2991171,2991172,2991173,2991174,2991175,2991176,2991177,2991178,2991179,2991180,2991181,2991182,2991183,2991184,2991185,2991186,2991187,2991188,2991189,2991190,2991191,2991192,2991193,2991194,2991195,2991196,2991197,2991198,2991199,2991200,2991201,2991202,2991203,2991204,2991205,2991206,2991207,2991208,2991209,2991210,2991211,2991212,2991213,2991214,2991215,2991216,2991217,2991218,2991219,2991220,2991221,2991222,2991223,2991224,2991225,2991226,2991227,2991228,2991229,2991230,2991231,2991232,2991233,2991234,2991235,2991236,2991237,2991238,2991239,2991240,2991241,2991242,2991243,2991244,2991245,2991246,2991247 SP - 5 EP - 90 ER -

TY - JOUR TI - STUDIES ON BACILLUS PYOCYANEUS AU - Noble P. Sherwood AU - T. L. Johnson AU - Ida Radotincky JO - The University of Kansas Science Bulletin VL - XVI IS - 2 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991249 N2 - 2991249,2991250,2991251,2991252,2991253,2991254,2991255,2991256,2991257 SP - 91 EP - 100 ER -

TY - JOUR TI - THE GENUS ERYTHRONEURA NORTH OF MEXICO AU - William Robinson JO - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN VL - XVI IS - 3 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991259 N2 - 2991259,2991260,2991261,2991262,2991263,2991264,2991265,2991266,2991267,2991268,2991269,2991270,2991271,2991272,2991273,2991274,2991275,2991276,2991277,2991278,2991279,2991280,2991281,2991282,2991283,2991284,2991285,2991286,2991287,2991288,2991289,2991290,2991291,2991292,2991293,2991294,2991295,2991296,2991297,2991298,2991299,2991300,2991301,2991302,2991303,2991304,2991305,2991306,2991307,2991308,2991309,2991310,2991311,2991312 SP - 101 EP - 156 ER -

TY - JOUR TI - STUDIES ON THE EGGS OF SOME REDUVIIDÆ (HETEROPTERA) AU - P. A. Readio JO - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN VL - XVI IS - 4 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991315 N2 - 2991315,2991316,2991317,2991318,2991319,2991320,2991321,2991322,2991323,2991324,2991325,2991326,2991327,2991328,2991329,2991330,2991331,2991332,2991333,2991334,2991335,2991336 SP - 157 EP - 180 ER -

TY - JOUR TI - THE NATURE, ORIGIN AND SIGNIFICANCE OF PIGMENT IN EMBRYOS OF AMBLYSTOMA AU - Hervey S. Faris JO - The University of Kansas Science Bulletin VL - XVI IS - 5 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991339 N2 - 2991339,2991340,2991341,2991342,2991343,2991344,2991345,2991346,2991347,2991348,2991349,2991350,2991351,2991352,2991353,2991354,2991355,2991356,2991357,2991358,2991359,2991360,2991361,2991362,2991363,2991364,2991365,2991366,2991367,2991368,2991369,2991370,2991371,2991372,2991373,2991374,2991376,2991377,2991378,2991379,2991380,2991381,2991382,2991383,2991384 SP - 181 EP - 228 ER -

TY - JOUR TI - ON A NEARLY COMPLETE LIZARD SKULL FROM THE OLIGOCENE OF NEBRASKA AU - Charles W. Gilmore JO - THE UNIVERSITY OF KANSAS SCIENCE BULLETIN VL - XVI IS - 6 PY - 1926-03-01 SN - 0022-8850 UR - https://biodiversitylibrary.org/page/2991387 N2 - 2991387,2991388,2991389,2991390 SP - 229 EP - 231 ER - — Reply to this email directly, view it on GitHub https://github.com/gbhl/bhl-segment-definition/issues/36#issuecomment-2079144324, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBJ2SDSAK55JNE6LKKH35OTY7IWADAVCNFSM6AAAAABDCYMR2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGE2DIMZSGQ. You are receiving this because you were mentioned.

rdmpage commented 2 months ago

@emnybg

I'll take a look at processing these two titles. My code is still evolving, and isn't in Python (I use PHP) so won't be in Coloab. But I am investigating ways to get this online in a usable form.

My approach depends on the type of item (is it a single title, is it a set of articles in a volume, is it a set of issues which one article per issue, etc.).

If there is a set of articles/issues then I use the ChatGPT on the TOC pages to get basic information (title, pages), I then find the corresponding starting pages in the item (based on page number, with string matching to check), then get metadata from those starting pages. This can result in erratic journal names. If I don't get a journal name I defer to the BHL title (which itself can be "wrong").

I'm using it successfully on a couple of titles I'm working on. I haven't tested it yet on tricky cases such as where we have multiple issues in the same item that each start on page 1. I have code elsewhere that I may have to use to help sort that out.

I'll keep you posted. I need to download a chunk of BHL content and run the scripts. I can give you the results in TSV if that's your preference.

Rod

emnybg commented 2 months ago

The journal name is actually irrelevant, I just wondered why it was inconsistent.

Plain text as in your example is fine. TSV, if you want to reformat to spreadsheet-friendly data, but I can do that.

In the Kansas items that I’ve seen so far, some have a single TOC near the start. Some have additional annotated TOC’s that are TOC’s of long articles, which we don’t want, but which are easy to remove if there aren’t a lot of them. This is typically true of single-article issues, from which I grab the title from the Title page instead. Some volumes (including some that are named as Part 1) have a Part 2 TOC around the middle of the volume; neither page numbers nor issue numbers restart, but the TOC page is buried. So, probably every variation you can think of.

Ellen

On Apr 26, 2024, at 9:09 AM, Roderic Page @.***> wrote:

@emnybg https://github.com/emnybg I'll take a look at processing these two titles. My code is still evolving, and isn't in Python (I use PHP) so won't be in Coloab. But I am investigating ways to get this online in a usable form.

My approach depends on the type of item (is it a single title, is it a set of articles in a volume, is it a set of issues which one article per issue, etc.).

If there is a set of articles/issues then I use the ChatGPT on the TOC pages to get basic information (title, pages), I then find the corresponding starting pages in the item (based on page number, with string matching to check), then get metadata from those starting pages. This can result in erratic journal names. If I don't get a journal name I defer to the BHL title (which itself can be "wrong").

I'm using it successfully on a couple of titles I'm working on. I haven't tested it yet on tricky cases such as where we have multiple issues in the same item that each start on page 1. I have code elsewhere that I may have to use to help sort that out.

I'll keep you posted. I need to download a chunk of BHL content and run the scripts. I can give you the results in TSV if that's your preference.

Rod

— Reply to this email directly, view it on GitHub https://github.com/gbhl/bhl-segment-definition/issues/36#issuecomment-2079365511, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBJ2SDSC3OYS2YCCHLIIMX3Y7JGWZAVCNFSM6AAAAABDCYMR2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGM3DKNJRGE. You are receiving this because you were mentioned.

rdmpage commented 2 months ago

@emnybg OK here is a first stab for 3179. Attached is a series of TSV files, each labelled with the corresponding BHL item. I've not had much of a look at these, some look at least vaguely sensible. If there are any that are clearly rubbish, let me know and I'll see what I can do. I'll be tweaking the code as I work through some journals that I'm currently working on.

Archive.zip

emnybg commented 2 months ago

This looks really good, Most of the “rubbish” is from volume 38, which has two very long articles with their own TOC which generated rows that are easily recognizable as bad.

I had collected v16-35, I did a quick comparison using concatenated volume-start page and got 239 matches vs the 257 rows I had collected. If I see any systemic problems as I work through them I’ll let you know, but so far it doesn’t look like it.

Ellen

On Apr 26, 2024, at 2:21 PM, Roderic Page @.***> wrote:

@emnybg https://github.com/emnybg OK here is a first stab for 3179 https://www.biodiversitylibrary.org/bibliography/3179. Attached is a series of TSV files, each labelled with the corresponding BHL item. I've not had much of a look at these, some look at least vaguely sensible. If there are any that are clearly rubbish, let me know and I'll see what I can do. I'll be tweaking the code as I work through some journals that I'm currently working on.

Archive.zip https://github.com/gbhl/bhl-segment-definition/files/15134069/Archive.zip — Reply to this email directly, view it on GitHub https://github.com/gbhl/bhl-segment-definition/issues/36#issuecomment-2079908072, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBJ2SDWAJWOO3DK5VZNIUFTY7KLKZAVCNFSM6AAAAABDCYMR2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZHEYDQMBXGI. You are receiving this because you were mentioned.