Closed gdower closed 5 years ago
"I'm planning on writing code to correct the author string"
You mean do a lookup to check whether the string is Aus bus (M., 1870) or Aus bus M., 1870, then correct the string? FYI, I've just audited a dataset which had the following malformed author strings:
1844 Cr.) Dahlborn) ([Den. & Sch.]) Den. & Schiff.) [Den. & Schiff.] Dewitz) Gr. & Rob.) (Grote ((Herr.-Sch.) Hew.) )Holland) Klug) micr(Chaudoir) Olliff) ([Schiff.]) tetra(Gray) )Wlkr.)
I'm fixing the malformed author strings before submitting them to gnparser.
I'm harvesting this dataset with a web crawler, so there's additional HTML formatting that I can use to accurately isolate the author string. I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis.
Thanks for the heads up. So far the web crawler hasn't hit any brackets or nested parentheses yet for this dataset, which would definitely be more challenging to correct automatically. I'm also logging a warning and will manually review that the author strings were corrected properly.
Ta.
"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."
...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?
I found 2 other issues related to author string parsing:
1) If the author name includes an apostrophe (e.g., O’Donnell) some software editors replace the apostrophe with a curly apostrophe, which breaks author parsing:
With curly apostrophe: https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%E2%80%99Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016
With regular apostrophe: https://parser.globalnames.org/?q=Ambaeolothrips+pampeanus+Mound%2C+Cavalleri%2C+O%27Donnell%2C+Infante%2C+Ortiz+%26+Goldarazena%2C+2016
2) Author first name initials that include hyphens break authorship parsing:
Xie Y-H, Yuan S-Y, Li Z-Y & Zhang H-R, 2013
Hyphenated: https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y-H%2C+Yuan+S-Y%2C+Li+Z-Y+%26+Zhang+H-R%2C+2013
Hyphens removed: https://parser.globalnames.org/?q=Ctenothrips+yangi+Xie+Y+H%2C+Yuan+S+Y%2C+Li+Z+Y+%26+Zhang+H+R%2C+2013
Removing the hyphens likely is not the proper way of formatting these author strings--I just removed the hyphens to show that the parser isn't handling hyphenated first names correctly.
Ta.
"I'm just testing for cases where there is either an opening parenthesis and no closing parenthesis, or vice versa, and appending the missing parenthesis."
...or removing the single parenthesis, if the correct form (say, in GBIF's backbone) is no parentheses?
Authors in parentheses are original ones, and I guess if open parenthesis is missing, it is safe to assume that everything up to the start of the authorship is original authors. However missing closed parenthesis is more dangerous to assume.
@dimus, of the author strings in CoL with hyphenated author given name initials, only around 2% don't include periods in the initials. Variants include:
Last A-B. Last A-B Last A.-B A-B. Last A-B Last A.-B Last
I notified the data provider of the parentheses typos, and he corrected them.
Thanks for your updates!
I added missing parenthesis cases to Go parser https://gitlab.com/gogna/gnparser/issues/40 it is part of v0.7.1
Closing it here, issues are sesolved in https://gitlab.com/gogna/gnparser/issues/28 and https://gitlab.com/gogna/gnparser/issues/40
If an author string is missing a parenthesis, gnparser considers it as an unparseable tail. For example:
Aus bus M., 1870) https://parser.globalnames.org/?q=Aus+bus+M.%2C+1870%29 Aus bus (M., 1870 https://parser.globalnames.org/?q=Aus+bus+%28M.%2C+1870
That might be the desired behavior, although it might also be useful if gnparser could parse malformed author strings with a warning.
I'm planning on writing code to correct the author string so that it can be parsed by gnparser.