gbif / name-parser

The core GBIF scientific name parser library
Apache License 2.0
17 stars 4 forks source link

detect nom refs better #44

Open mdoering opened 5 years ago

mdoering commented 5 years ago

We detect nomenclatural references inside or better at the end of a name by looking either for common keywords like Journal or by spotting a numbering block for volumes/pages like 8(5): 563.

We can improve this by actually looking for known journals, thereby also removing arbitrary middle titles with a procedure Guido uses for years:

Regarding a list of journal names: Maybe it's possible to extract one from all the DwC-As? The journal names should mostly be right before what I've come to call numbering block, and given a sufficiently hight number of references, that might yield a pretty extensive list. Maybe you might end up cutting a few short or including a tailing chunk of the title, but it's a good starting point. From there, extraction of common phrases from the raw journal names should help getting rid of the title chunks. With that, you can then revisit the reference list and see if you find more.

Abbreviations handle nicely if you extract the sequence of initial capital letters (e.g. "Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen" would become "PKNAW") of the name parts and index all the journal names by that, in buckets. Then you can go through a matching bucket and see if your input phrase matches the starts of all words of some journal name in there, in ordered sequence, of course. That would match "Proc. Koninkl. Nederl. Akad. Wetensch.", for instance