GlobalNamesArchitecture / gnparser

Split scientific names to meaningful elements with meta information
https://parser.globalnames.org/
MIT License
20 stars 2 forks source link

Malformed author strings #474

Closed gdower closed 5 years ago

gdower commented 5 years ago

If an author string is missing a parenthesis, gnparser considers it as an unparseable tail. For example:

Aus bus M., 1870) https://parser.globalnames.org/?q=Aus+bus+M.%2C+1870%29 Aus bus (M., 1870 https://parser.globalnames.org/?q=Aus+bus+%28M.%2C+1870

That might be the desired behavior, although it might also be useful if gnparser could parse malformed author strings with a warning.

I'm planning on writing code to correct the author string so that it can be parsed by gnparser.

Mesibov commented 5 years ago

"I'm planning on writing code to correct the author string"

You mean do a lookup to check whether the string is Aus bus (M., 1870) or Aus bus M., 1870, then correct the string? FYI, I've just audited a dataset which had the following malformed author strings:

1844 Cr.) Dahlborn) ([Den. & Sch.]) Den. & Schiff.) [Den. & Schiff.] Dewitz) Gr. & Rob.) (Grote ((Herr.-Sch.) Hew.) )Holland) Klug) micr(Chaudoir) Olliff) ([Schiff.]) tetra(Gray) )Wlkr.)

gdower commented 5 years ago

I'm fixing the malformed author strings before submitting them to gnparser.

I'm harvesting this dataset with a web crawler, so there's additional HTML formatting that I can use to accurately isolate the author string. I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis.

Thanks for the heads up. So far the web crawler hasn't hit any brackets or nested parentheses yet for this dataset, which would definitely be more challenging to correct automatically. I'm also logging a warning and will manually review that the author strings were corrected properly.

Mesibov commented 5 years ago

Ta.

"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."

...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?

gdower commented 5 years ago

I found 2 other issues related to author string parsing:

1) If the author name includes an apostrophe (e.g., O’Donnell) some software editors replace the apostrophe with a curly apostrophe, which breaks author parsing:

With curly apostrophe: https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%E2%80%99Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016

With regular apostrophe: https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%27Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016

2) Author first name initials that include hyphens break authorship parsing:

Xie Y-H, Yuan S-Y, Li Z-Y & Zhang H-R, 2013

Hyphenated: https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y-H%2C+Yuan+S-Y%2C+Li+Z-Y+%26+Zhang+H-R%2C+2013

Hyphens removed: https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y+H%2C+Yuan+S+Y%2C+Li+Z+Y+%26+Zhang+H+R%2C+2013

Removing the hyphens likely is not the proper way of formatting these author strings--I just removed the hyphens to show that the parser isn't handling hyphenated first names correctly.

dimus commented 5 years ago
  1. Missing open parenthesis: I would say opens a can of worms that I am afraid to deal with.
  2. Curvy apostrophe sound like a safe addition, +1 for adding it to pre-processing stage.
  3. Hythen without a period is something I haven't meet before, do you have many names like this @gdower ?
dimus commented 5 years ago

Ta.

"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."

...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?

Authors in parentheses are original ones, and I guess if open parenthesis is missing, it is safe to assume that everything up to the start of the authorship is original authors. However missing closed parenthesis is more dangerous to assume.

gdower commented 5 years ago

@dimus, of the author strings in CoL with hyphenated author given name initials, only around 2% don't include periods in the initials. Variants include:

Last A-B. Last A-B Last A.-B A-B. Last A-B Last A.-B Last

I notified the data provider of the parentheses typos, and he corrected them.

Thanks for your updates!

dimus commented 5 years ago

I added missing parenthesis cases to Go parser https://gitlab.com/gogna/gnparser/issues/40 it is part of v0.7.1

dimus commented 5 years ago

Closing it here, issues are sesolved in https://gitlab.com/gogna/gnparser/issues/28 and https://gitlab.com/gogna/gnparser/issues/40