bionomia / dwc_agent

Ruby gem to cleanse Darwin Core terms containing people names prior to passing to its dependent parser. Comes with a command-line utility.
MIT License
5 stars 1 forks source link

Improving of name parsing (parse and clean) #18

Open infinite-dao opened 1 year ago

infinite-dao commented 1 year ago

Hej-hej,

I’m aware of the difficult task to perform a good parsing and cleaning for all the name list cases out there. So here are some more names that get parsed sometimes and doing the cleaning they get lost (see attachment, dwcagent: 3.0.8.0, I used the wrapper https://github.com/infinite-dao/collector-matching/blob/main/bin/agent_parse4tsv.rb)

In our data of BGBM we often have like a regex name-list-separator /(, | & )/, e.g.:

So, there are cases when parsing the particle, that could be improved. A difficult case—and I think it is partly not going to be solved from the parser—is:

The attached files are names from BGBM and Meise; in the logfile you can look in column related_parsed_name and cleaned_index_name_of_empty_result is the index of a cleaned result that gets empty. Most names are Herbarium things or institutions but there are also some real names, e.g. in Meise:

dwcagent "A. Charpin, P. Hainard & R. Salanon"  \
 | jq -c '.[] | with_entries(select(.value |.!=null))'

… the 3rd name seems missing:

{"family":"Charpin","given":"A."}
{"family":"Hainard","given":"P."}

Attached log files from parsing with wrapper agent_parse4tsv.rb (see https://github.com/infinite-dao/collector-matching/tree/main/bin — I hope the column names of the files are self explaining):

dshorthouse commented 1 year ago

Thanks for the great work on this. The A. Charpin, P. Hainard & R. Salanon case was a too greedy use of 'anon' in a BLACKLIST and that's now fixed. The variable placement of 'von' in the source should also be accommodated now in v3.0.9 just pushed. The Álvarez de Zayas,A. example requires some more investigation.

infinite-dao commented 1 year ago

Thank you for taking on this issue.

Here are some more ;-)

related_parsed_name after cleaning source string comment
L.E. Bureau at cleaned_0index:0 Bureau,L.E.
Émil Bureau at cleaned_0index:0 Bureau,Émil
E.R. Guaglianone at cleaned_0index:2 Burkart A., Troncoso,N.S., Guaglianone,E.R., Rotman,A., Botta,S. & Buck,H.
P. Classe at cleaned_0index:0 Classe,P. & Gebauer,R.
G. Classen at cleaned_0index:0 Classen,G.
R. Claßen at cleaned_0index:0 Claßen,R. & Hagemann,I.
J.B.L. Companyo at cleaned_0index:0 Companyo,J.B.L.
Farmer Braun at cleaned_0index:0 Farmer Braun
Theodor Magnus Fries at cleaned_0index:0 Fries,Theodor (Thore) Magnus & & al. probably also faulty input
Goetzen Graf von at cleaned_0index:0 Goetzen Graf von & Maire,C.
E.R. Guaglianone at cleaned_0index:0 Guaglianone,E.R. & Múlgura,M.E.
H u. S at cleaned_0index:0 H u. S
H. d. D. T. at cleaned_0index:0 H. d. D. T.
R. Claßen at cleaned_0index:1 Hagemann,I. & Claßen,R.
J. Poel van de at cleaned_0index:3 Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R.
Heldreich De at cleaned_0index:0 Heldreich De
De at cleaned_0index:1 Heldreich De & Halácsy,E. von
Theodor Heinrich at cleaned_0index:1 Heldreich,Theodor Heinrich von
I. B. at cleaned_0index:0 I. B.
L. B. at cleaned_0index:0 L. B.
von der at cleaned_0index:1 Marck,J.W.C.T., von der probably also faulty input
Martius. C.F.P. von at cleaned_0index:0 Martius. C.F.P. von (no. Herb. Fl. Bras. 483)
D.B. Poindexter at cleaned_0index:1 Nelson,J.B. & Poindexter,D.B.
G. Classen at cleaned_0index:1 Raadts,E. & Classen,G.
August Leopold von at cleaned_0index:1 Reuss,August Leopold von & Reuss,A.L. von
Theodor Schube at cleaned_0index:0 Schube,Theodor
A. Senoner at cleaned_0index:0 Senoner,A.
D. Stafford at cleaned_0index:0 Stafford,D.
Theodor Strauss at cleaned_0index:0 Strauss,Theodor
Theusch at cleaned_0index:0 Theusch
Theusz at cleaned_0index:0 Theusz
Theusz at cleaned_0index:0 Theusz (no. 517)
Thevenon at cleaned_0index:0 Thevenon
Toward at cleaned_0index:0 Toward
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes de
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes,B.E.E. de
Winter de at cleaned_0index:0 Winter de & Hardy
Theodor Wolf at cleaned_0index:0 Wolf,Theodor
infinite-dao commented 1 year ago

Here some more names from Meise data:

related_parsed_name after cleaning empty source string comment
J. B. at cleaned_0index:0 ? J. B.
F.M.C. V. at cleaned_0index:0 ?F.M.C. V.
Jean Malvaux sc. at cleaned_0index:0 ?Jean Malvaux sc.
B. K. at cleaned_0index:1 ?P.E.G. & B. K.
A F. at cleaned_0index:0 A F.
A W. at cleaned_0index:0 A W., Herb. François Crépin
A at cleaned_0index:0 A, i J. Kornasiowie
A. B at cleaned_0index:0 A. B
A. B at cleaned_0index:0 A. B[auy] [= Braun A.?]
P.M. A. at cleaned_0index:2 A. Besga, M.A. Domingo, A., P.M. & X. Uribe-Echebarría
A. Blytt in Musée Christiania in F. Crépin at cleaned_0index:0 A. Blytt in Musée Christiania in F. Crépin - Herbarium rosarum OK difficult a collection name of some kind
A. R at cleaned_0index:0 A. R[ac][]
A. R at cleaned_0index:0 A. R[aesdonk]
A. R at cleaned_0index:0 A. R[oc][]
ABaillot at cleaned_0index:0 ABaillot perhaps very abbreviated, or misstake
ABeck at cleaned_0index:0 ABeck perhaps very abbreviated, or misstake
ABecke at cleaned_0index:0 ABecke perhaps very abbreviated, or misstake
ABecre at cleaned_0index:0 ABecre perhaps very abbreviated, or misstake
ABeneschi at cleaned_0index:0 ABeneschi perhaps very abbreviated, or misstake
ABlol at cleaned_0index:0 ABlol perhaps very abbreviated, or misstake
ABunge at cleaned_0index:0 ABunge perhaps very abbreviated, or misstake
AHes at cleaned_0index:0 AHes perhaps very abbreviated, or misstake
AHlvay at cleaned_0index:0 AHlvay perhaps very abbreviated, or misstake
AKessemantt at cleaned_0index:0 AKessemantt perhaps very abbreviated, or misstake
AKranz at cleaned_0index:0 AKranz perhaps very abbreviated, or misstake
ALetourneux at cleaned_0index:0 ALetourneux perhaps very abbreviated, or misstake
ALinae at cleaned_0index:0 ALinae perhaps very abbreviated, or misstake
ALongo at cleaned_0index:0 ALongo perhaps very abbreviated, or misstake
ALux at cleaned_0index:0 ALux perhaps very abbreviated, or misstake
ANlemtin at cleaned_0index:0 ANlemtin perhaps very abbreviated, or misstake
ARothms at cleaned_0index:0 ARothms perhaps very abbreviated, or misstake
ARukimo at cleaned_0index:0 ARukimo perhaps very abbreviated, or misstake
ARusland at cleaned_0index:0 ARusland perhaps very abbreviated, or misstake
ASchumacher at cleaned_0index:0 ASchumacher perhaps very abbreviated, or misstake
ASecri at cleaned_0index:0 ASecri perhaps very abbreviated, or misstake
ASeegreg at cleaned_0index:0 ASeegreg perhaps very abbreviated, or misstake
AVDHangls at cleaned_0index:0 AVDHangls perhaps very abbreviated, or misstake
AValon at cleaned_0index:0 AValon perhaps very abbreviated, or misstake
AVirans at cleaned_0index:0 AVirans perhaps very abbreviated, or misstake
AViranz at cleaned_0index:0 AViranz perhaps very abbreviated, or misstake
AVranz at cleaned_0index:0 AVranz perhaps very abbreviated, or misstake
AWal at cleaned_0index:0 AWal[raven] perhaps very abbreviated, or misstake
AWalcana at cleaned_0index:0 AWalcana perhaps very abbreviated, or misstake
AWariouy at cleaned_0index:0 AWariouy perhaps very abbreviated, or misstake
AWeatherby at cleaned_0index:0 AWeatherby perhaps very abbreviated, or misstake
D. at cleaned_0index:0 Abbé D. [Duyany]
P at cleaned_0index:0 Abbé P[]ant
Ad de at cleaned_0index:0 Ad de [Genssen]
Farmer D. at cleaned_0index:2 Akeroyd J., Brookes S., Farmer D. & Jury S.
Arnaud Ch. at cleaned_0index:0 Arnaud Ch.
Arnold Cl. at cleaned_0index:0 Arnold Cl.
Arnold Fr. at cleaned_0index:0 Arnold Fr.
Becker J. Ph. at cleaned_0index:0 Becker J.Ph.
Bell C.R at cleaned_0index:0 Bell C.R & Bell S.F.
Bequet A.B at cleaned_0index:0 Bequet A.B
Bertolani Ch. at cleaned_0index:0 Bertolani (o) Ch.
Bidgood S. at al. at cleaned_0index:0 Bidgood S. at al.
Binder Em. at cleaned_0index:0 Binder Em.
M.L. Blickenstaff at cleaned_0index:0 Blickenstaff M.L.
Bouillenne Cl. at cleaned_0index:0 Bouillenne Cl.
Granville J.J. De at cleaned_0index:0 Granville J.J. De, Acevedo P., Boyer A., Hollenberg L.
Reichgelt Th. at cleaned_0index:1 van Ooststroom S.J. & Reichgelt Th.
infinite-dao commented 1 year ago

Yes, it is getting better, very good.

With dwcagent version 3.0.11.0 I have parsed names again and some contain two first names connected with an “and” and only one family name, so some kind of abbreviated spelling, for example: «R. Mizuno & C. W. and L. B. O'Brien», which could be resolved as «R. Mizuno & C. W. O'Brien & L. B. O'Brien», but difficult, as there are other “and”-stringings as in: «Puerto Rico and the Mona Passage & Desecheo Is.», which seems mixed input, and hence unable to resolve correctly.

Here are cases where the first one is correctly parsed and cleaned, also interpreted as I would expect it to, and in the second case one name is dropped, somehow unexpected:

  NAMELIST=(
    "M. and C. Jaschhof & Project"
    "S. Yu. and N. V. Kuznetsov"
  )
  IFS=''
  for text in ${NAMELIST[*]};   do
    echo "input: $text"
    dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
  done
  unset $IFS

… gives: input: M. and C. Jaschhof & Project

  {"family":"Jaschhof","given":"M."}
  {"family":"Jaschhof","given":"C."}

input: S. Yu. and N. V. Kuznetsov

  {"family":"Kuznetsov","given":"N.V."}

So here are some “and”-concatenations, I hope the comments make any sense (with “midst-abbreviated spelling”, I mean in the midst of the name list):

related_parsed_name after cleaning empty source string comment
H. W. at cleaned_0index:5 Baltic Amber & Prussian Fm. & C. and H. W. Hoffeins & C. & H. W. kind of midst-abbreviated spelling
M. S. at cleaned_0index:2 Blackdown Tableland & Expedition Rg. & M. S. and B. J. Moulds kind of midst-abbreviated spelling
R. H. at cleaned_0index:2 C. Liang & W. LaBerge & R. H. and L. D. Beamer kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. B. O’Brien kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. B. O'Brien & R. Mizuno kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. O'Brien kind of midst-abbreviated spelling
C. W. at cleaned_0index:0 C. W. and L. O'Brien & G. Wibmer kind of midst-abbreviated spelling
D.T. Le at cleaned_0index:0 D.T. Le, D.T. Truong, H.Q. Nguyen, N.H. Nguyen, and A.N. Nguyen here the name separator seems as regex: /(,\s+\|,\s+and\s+)/
M. V. at cleaned_0index:1 E. A. Yagmur & M. V. and S. V. Nabozhenko & B. Keskin & I. Chigr kind of midst-abbreviated spelling
No at cleaned_0index:2 F. W. and S. K. Gess & No kind of midst-abbreviated spelling
M. V. at cleaned_0index:1 Islahiye District, W & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
team at cleaned_0index:1 J. A McGuire and team correctly cleaned
R. J. B. Rockshop field crew at cleaned_0index:1 Japh Boyce and R. J. B. Rockshop field crew correctly cleaned, one could use the non-cleaned parsed result
J. Cl. at cleaned_0index:0 J. Cl. and P. Gauthier kind of midst-abbreviated spelling
J. H. at cleaned_0index:0 J. H., A. M. and A. W. Skevington kind of midst-abbreviated spelling
A. M. at cleaned_0index:1 J. H., A. M. and A. W. Skevington kind of midst-abbreviated spelling
A.M. J.H. at cleaned_0index:0 J.H., A.M. and A.W. Skevington kind of midst-abbreviated spelling
J. C. R. at cleaned_0index:2 J. P. W. Hall & K. R. Willmott & J. C. R. and J. I. R. Willmott kind of midst-abbreviated spelling
team at cleaned_0index:1 Lamas, Nihei and team correctly cleaned
L. B. at cleaned_0index:0 L. B. and C. W. O' Brien kind of midst-abbreviated spelling
J. H. at cleaned_0index:0 Leg. & J. H., A. W. and A. M. Skevington mixed input, but kind of 3-fold abbreviated spelling
A. W. at cleaned_0index:1 Leg. & J. H., A. W. and A. M. Skevington mixed input, but kind of 3-fold abbreviated spelling
L. E. L. at cleaned_0index:0 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. F. V. at cleaned_0index:1 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. F. V. Skeleton at cleaned_0index:3 L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton kind of midst-abbreviated spelling
M. B. at cleaned_0index:2 Male, WA & Fortescue R. & M. B. and B. J. Moulds kind of midst-abbreviated spelling
Project at cleaned_0index:2 M. and C. Jaschhof & Project correctly parsed and cleaned
McIlwrath Ra. at cleaned_0index:0 McIlwrath Ra. & G. and A. Daniels. R. Eastwood kind of midst-abbreviated spelling
M. J. at cleaned_0index:0 M. J. and M. - L. Penrith kind of abbreviated spelling
M. V. at cleaned_0index:1 M. V. Nabozhenko & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
T. H. at cleaned_0index:1 New Zealand H. B. Mohi Bush Waimarama & T. H. and J. M. Davies & kind of midst-abbreviated spelling
C. W. at cleaned_0index:1 N Metzquititlan & C. W. and L. O'Brien & G. Wibmer kind of midst-abbreviated spelling
A. M. at cleaned_0index:1 Nullagine & A. M. and M. J. Douglas kind of midst-abbreviated spelling
Pedro M. Ruiz-Carranza. Adult at cleaned_0index:0 Pedro M. Ruiz-Carranza. Adult male (ICN 19727), and adult female mixed input, hence unable to resolve correctly
C. W. at cleaned_0index:1 P. N. Braulio Carrillo & C. W. and L. B. O'Brien kind of midst-abbreviated spelling
the Mona Passage at cleaned_0index:1 Puerto Rico and the Mona Passage & Desecheo Is. mixed input, hence unable to resolve correctly
Desecheo Is. at cleaned_0index:2 Puerto Rico and the Mona Passage & Desecheo Is. mixed input, hence unable to resolve correctly
the Mona Passage at cleaned_0index:1 Puerto Rico and the Mona Passage & F. Fisk mixed input, hence unable to resolve correctly
R. L. at cleaned_0index:0 R. L., and B. B. Brown here the name separator seems as regex: /(,\s+\|,\s+and\s+)/
C. W. at cleaned_0index:1 R. Mizuno & C. W. and L. B. O'Brien kind of midst-abbreviated spelling
M. S. at cleaned_0index:1 Roper R. & M. S. and B. J. Moulds kind of midst-abbreviated spelling
M. E. at cleaned_0index:1 S. San Pedro Sula & M. E. and P. D. Perkins kind of midst-abbreviated spelling
S. Yu. at cleaned_0index:0 S. Yu. and N. V. Kuznetsov kind of abbreviated spelling
M. V. at cleaned_0index:2 W Antakya, S & M. V. and S. V. Nabozhenko & B. Keskin kind of midst-abbreviated spelling
M. S. at cleaned_0index:1 Wheeny Ck. & M. S. and B. J. Moulds & G. Williams & W. Newham & kind of midst-abbreviated spelling
dshorthouse commented 1 year ago

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

The example C. W. and L. O'Brien has been fixed in the working code with a spec test and will be released soon. The late Charlie O'Brien and his very much living wife Lois O'Brien would be pleased.

The rest of the mixed uses of &/and alongside family members with interlopers I fear will be too challenging to tackle. However, the examples like J. H., A. M. and A. W. Skevington should be easy enough to accommodate (though rare). It's not at all common than three family members collect together, but Jeff Skevington and family might also be pleased if I can do this.

dshorthouse commented 1 year ago

Prematurely closed via a commit message.

infinite-dao commented 1 year ago

The example S. Yu. and N. V. Kuznetsov is a tricky one. Is that a misplaced period after a family name Yu or is it a pair of collectors that share the same family name Kuznetsov like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.

Yes, true, it’s difficult—I’m not sure if it’s the same person (because of the botanical data background)— one S. Yu. Kuznetsov is here as an example: https://www.researchgate.net/profile/S-Kuznetsov-3. So at least a real “life”-example.

infinite-dao commented 1 year ago

Here are some form dwcagent 3.0.12.0, and botanical names in the context of BGBM Berlin:

related_parsed_name after cleaning empty source string comment
F.A. Marschall v. Bieberstein at cleaned_0index:0 Bieberstein,F.A. Marschall v. unexpectedly dropped from cleaning—is it a ?bug in dwcagent version 3.0.12.0
N.H. Le at cleaned_0index:2 Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. unexpectedly dropped related_parsed_name somehow: should «N.H. Le» not be dropped on cleaning(?)
Goetzen Graf von at cleaned_0index:0 Goetzen Graf von & Maire,C. unexpectedly dropped related_parsed_name—is it regarded as a name too general, like Princess of Bavaria or something?
J. Poel van de at cleaned_0index:3 Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R. unexpectedly dropped related_parsed_name
Heldreich De at cleaned_0index:0 Heldreich De better curate source data
De at cleaned_0index:1 Heldreich De & Halácsy,E. von better curate source data
I. B. at cleaned_0index:0 I. B. better curate source data
L. B. at cleaned_0index:0 L. B. better curate source data
von der at cleaned_0index:1 Marck,J.W.C.T., von der is probably Johann Wilhelm Carl Theodor von der Marck
Martius. C.F.P. von at cleaned_0index:0 Martius. C.F.P. von (no. Herb. Fl. Bras. 483) is probably Carl Friedrich Philipp von Martius but family name should be curated, not being abbreviated
D.B. Poindexter at cleaned_0index:1 Nelson,J.B. & Poindexter,D.B. unexpectedly dropped related_parsed_name
August Leopold von at cleaned_0index:1 Reuss,August Leopold von & Reuss,A.L. von difficult to guess the only two names concatenated by «&», probably a wrong parser interpretation—it seems also better to curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes I guess missing information in source data, better curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes de I guess missing information in source data, better curate source data
Wilde de at cleaned_0index:0 Wilde de & Wilde-Duyfjes,B.E.E. de I guess missing information in source data, better curate source data
Winter de at cleaned_0index:0 Winter de & Hardy I guess missing information in source data, better curate source data
Á E. at cleaned_0index:7 Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C. as dicussed previously, wrong input data at «Köhler, E. …», no way to guess it right from parser perspective
infinite-dao commented 1 year ago
dwcagent "Galán,P. & Montenegro,S.M."
# []

… gets parsed to be unexpectedly empty in dwcagent 3.0.12.0

Checking it:

NAMELIST=(
  "Galán,P."
  "Montenegro,S.M."
  "Galán,P. & Montenegro,S.M."
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "\n-----------\ninput: $text\n"
  dwcagent "${text}"   | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS

… we get:

-----------
input: Galán,P.
{"family":"Galán","given":"P."}

-----------
input: Montenegro,S.M.

-----------
input: Galán,P. & Montenegro,S.M.
dshorthouse commented 1 year ago

A few new updates now in v 3.0.13.0:

Bieberstein,F.A. Marschall v. regression is restored.

Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. now parses. The Le was aggressively cleaned from a blacklist.

Galán,P. & Montenegro,S.M. now parses. Montenegro was in a country blacklist.

Nelson,J.B. & Poindexter,D.B. now works. The greedy regex found index in Poindexter.

infinite-dao commented 11 months ago

Hi David,

sometimes, if there are square brackets in the middle of the names, with probable letters, e.g. Buto[m]a, then when it is parsed in dwc_agent: 3.0.13.0, it is split into parsed:Buto a and cleaned:Buto A, so an artificial first name is created, in this case. Actually, if the brackets were just ignored, it would be better if Buto[m]a came out Butoa and not Buto A, or Malaisse Matera;Wa[s]terlain to Malaisse Matera<SEP>Waterlain, right?

The attachment contains many of these examples from Meise name data with square brackets, where they are in the name string itself:

I think a good strategy would be to remove the [...] if they are immediately inside a string, between the letters, and then just remove them without inserting a space.

infinite-dao commented 11 months ago

We are getting there — almost :grin: … here with dwc_agent: 3.0.14.0 I found some unexpected results:

#!/bin/bash
NAMELIST=(
  "C. Be[n]ed. [St]enn"       # would expect Stenn
  "Abb. No[yr]ey, abb. Faure" # would expect Abb. Faure
  "Abbé F[r]. Hy in Ch. Flahault" # would expect dot in Fr.
  "An[s]on"                   # would expect Anson
  "Arm. A[n]spach"            # would expect dot to keep in Arm.
  "Attila Meste[r]há[zy]"     # would expect Mesterházy
  "Conzatti, [Holned] & Ordó[ñ]er" # would expect parsed: Conzatti<SEP>Holned<SEP>Ordóñer
  "Corn. C[oriz]e"            # would expect dot to keep
  "Dr. B[on]bier"             # would expect: Bonbier
  "Dr. [B] [B]oiglaender-[T]ekn[]" # would expect parsed: Dr. B Boiglaender-Tekn
  "G.[Char][][g][] in F. Crépin - Herbarium Rosarum" # would expect: G.Charg in F. Crépin
  "Ga[r]dama[uic][][te]" # would expect: Gardamauicte

  # nothing parsed so far the following:
  "R. Ba[renda]/J.Willemsen"  # is parsed all right the following not
  "R. Barendse/[J]. Willemse" # nothing parsed?
  "V.[S]. Nikitin - T.I Zhilenko" # nothing parsed?
  "L. [G][][trofi][]"         # would expect: L. Gtrofi
  "B.E. & J.[G]. Juniper"     # would expect parsed: B.E.<SEP>J.G. Juniper 
      # or interpreted parsed: B.E. Juniper<SEP>J.G. Juniper
      # is dwcagent "B.E." expected to be parsed or not, I guess not?
  "D. Ba[nsk], M. Mathais, R.E.W. jr" # nothing parsed?
  "C[o]l"                     # nothing parsed?
  "C.[R]. Gerard"             # nothing parsed?
  "Comm. K.[j]. Cameron"      # nothing parsed?
  "L.[G]. Ravaud"             # nothing parsed?
  "D.[f]. [G]i[lf]illa[n]"    # nothing parsed?
  "M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal" # nothing parsed?
  "N.S.[q]. in M.R. Clarke"   # nothing parsed?
)
IFS=''
for text in ${NAMELIST[*]};   do
  echo -en "-----------\ninput: $text\n"
  results=$(dwcagent "${text}") 
  if [[ "${results-}" == "[]" ]];then
  echo "output: $results"
  else
  echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
  fi
done
unset $IFS
-----------
input: C. Be[n]ed. [St]enn
{"family":"Enn","given":"C. Bened"}
-----------
input: Abb. No[yr]ey, abb. Faure
{"family":"Noyrey","given":"Abb"}
{"family":"Faure","given":"AbB."}
-----------
input: Abbé F[r]. Hy in Ch. Flahault
{"family":"Ch. Flahault","given":"Fr  Hy","particle":"in","title":"Abbé"}
-----------
input: An[s]on
{"family":"Ans"}
-----------
input: Arm. A[n]spach
{"family":"Anspach","given":"Arm"}
-----------
input: Attila Meste[r]há[zy]
{"family":"Mesterh","given":"Attila"}
-----------
input: Conzatti, [Holned] & Ordó[ñ]er
{"family":"Conzatti"}
{"family":"Er","given":"Ord"}
-----------
input: Corn. C[oriz]e
{"family":"Corize","given":"Corn"}
-----------
input: Dr. B[on]bier
{"family":"Bier","given":"B.","title":"Dr."}
-----------
input: Dr. [B] [B]oiglaender-[T]ekn[]
{"family":"Ekn","given":"Oiglaender","title":"Dr."}
-----------
input: G.[Char][][g][] in F. Crépin - Herbarium Rosarum
{"family":"F. Crépin","given":"G.","particle":"in"}
-----------
input: Ga[r]dama[uic][][te]
{"family":"Te","given":"Gardamauic"}
-----------
input: R. Ba[renda]/J.Willemsen
{"family":"Barenda","given":"R."}
{"family":"Willemsen","given":"J."}
-----------
input: R. Barendse/[J]. Willemse
output: []
-----------
input: V.[S]. Nikitin - T.I Zhilenko
output: []
-----------
input: L. [G][][trofi][]
output: []
-----------
input: B.E. & J.[G]. Juniper
output: []
-----------
input: D. Ba[nsk], M. Mathais, R.E.W. jr
output: []
-----------
input: C[o]l
output: []
-----------
input: C.[R]. Gerard
output: []
-----------
input: Comm. K.[j]. Cameron
output: []
-----------
input: L.[G]. Ravaud
output: []
-----------
input: D.[f]. [G]i[lf]illa[n]
output: []
-----------
input: M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal
output: []
-----------
input: N.S.[q]. in M.R. Clarke
output: []
dshorthouse commented 11 months ago

With the v.3.0.15.0 release, I think we're getting pretty close to accommodating many of the issues identified above. Some were intractable & some meant re-evaluating the strict spec tests.

infinite-dao commented 10 months ago

Parsing of name particles

In 3.0.16.0 I parsed an example with comma (real given example) and the interpreted same name without comma, and the results are different, should they not be the same? Using the dwcagent wrapper I get:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

Reyna de Aguilar,M.L. is the the actual example (e.g. from «Flores,J., Montalvo,E.A., Reyna de Aguilar,M.L. & Calderón,M.») and M.L. Reyna de Aguilar is just the reverse of the comma concatenating version, like so:

Reyna de Aguilar,M.L.
………………………………………… ____
  ↘             ↙
    ⋅         ↙
      ⋅     ↙
        ⋅ ↙
        ↙ ⋅
      ↙     ⋅
    ↙         ⋅
  ↙             ↘
____ …………………………………………
M.L. Reyna de Aguilar

It is difficult to analyse the name particles — for instance the https://en.wikipedia.org/wiki/Nobiliary_particle or https://en.wikipedia.org/wiki/German_nobility#Nobiliary_particles — and they can describe a place or another family name so to say, I think, e.g.

… so it is difficult to assign the name parts to either …

I think this is not solvable 100%, because there are too many combinations, I just want to mention it here.

dshorthouse commented 10 months ago

Looks like your variously arranged names, particles and initials:

-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}

Confuse the underlying Namae gem, https://github.com/berkmancenter/namae/issues upon which dwc_agent is dependent on.

Screen Shot 2023-11-02 at 1 22 24 PM
infinite-dao commented 10 months ago

By the way: brackets, this time round parentheses name (...) :grin: — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

Examples I found from WikiData with parentheses and are candidates that I thought to standardize by dwc_agent as well, are the following:

-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}

I did not quite expect that round parenthesis would be removed entirely, and I did expected that it would filter cases out or so and leave some in. One compromise one could think of, is to try to simply add the last parenthesis content to the parsing field family if it does not contain any dates or terms of occupations like botanist aso. — or shall this be documented that all round parenthesis content is removed?

dshorthouse commented 10 months ago

By the way: brackets, this time round parentheses name (...) 😁 — in dwc_agent (3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

infinite-dao commented 10 months ago

That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.

I completely agree with this. So perhaps let it that way.


I have further analysed the botanist names from WikiData and divided up the cases that appear: most of them are actually different versions of names, squeezed into one line, so to speak. Actually, from a data point of view, a correction would have to be made and two or more names with the same meaning would have to be unravelled. An interesting case are the names with only one letter in the parentheses. Here are some examples of botanical names from WikiData:

Alan (C.) McKay Carolyn [Caroline] (A.) Young H.(Kh.) Karis Kh.(Ch.) G. Kulieva Yu.(Ju.) E. Petrov D.(T.) O'Gorman Robert J. Ferry (Jr.). Tatiana (Yu.) Gagkaeva

Anton(i) Wróblewski Dian Min(e) Chang Jacob Frederic(k) Brenckle …

Arthur C. (II) Grupe Bruno E.C. (de) Miranda Cun Ti (Di) Xiang Davi Mesquita (de) Macedo Julian(us) Hendrik Molkenboer Manuel (de) Assunção Diniz Marcos Antonio de (Jr) Morais Marcus (de) Melo Teixeira Priscila Sanjuan (de) Medeiros You (Yu) Wen Tsui

Adolph(Adolf) Osterwalder Chris(toffel) F.J. Spies Constantine Demetry(Dmitriev) Sherbakoff Dénes(Dionisie) Pázmány Peng Yong(Yun) Zhang Ze(Tse) Xiang Peng Zu(Tsu) Tang Yin …

Aloysius (Luigi) Meschinelli Aldworth William (Tommy) Thompson Tina (Antje) Hofmann Wen (Wan) Jia Zhu Yin Tong (Tang) Xie …

cat Liste_ursprünglich_vermindert_4.txt | sed -r 's@^[^()]*\([^()]*\)[^()]*$@1 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@2 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@3 (…) - &@;' | sort

1 (…) - ('Wilson') Sze Wing Wong 1 (…) - (Alexandre Alexis) George Le Monnier 1 (…) - (Bartolomeo Giacomo) Rinaldo Corradi 1 (…) - (Léon Marie Joseph) Gustave Nicolas 1 (…) - (Antonius) Theodoor Wegelin 1 (…) - (B.) Alfred Steinmann 1 (…) - (M.)F. Weyhe 1 (…) - (Mrs.) Leslie Hofer [Mrs. Vincent W.] Lanfear 1 (…) - (Q.)S.Y. Yeung 1 (…) - (Sister) Little Flower 1 (…) - (Wilhelm) William Winkler 1 (…) - A.H.G. ('Bert') Gerrits van den Ende 1 (…) - Albert (Jacob Josef) Vandevelde 1 (…) - August (François Marie Antoine) Tonglet 1 (…) - Amalesh Choudhury (bot.) 1 (…) - Bror Pettersson (botaniker) 1 (…) - István Balázs (instruisto) 1 (…) - James Smith (diatomist) 1 (…) - Robert W. Jones (botanist) 1 (…) - Ting-Ting Zhang (mycologist) 1 (…) - Yi Huang (botanist-1) 1 (…) - Analy Salles (de Azevedo) Melo 1 (…) - Bill Kasongo (Wa Ngoy Kashiki) 1 (…) - Franz August (`Friedrich') Müller 1 (…) - Friedrich (`Franz') Joseph Schelver 1 (…) - Frédéric-Edouard (`Fritz') Kampmann 1 (…) - Félix de Azara (1742-1821) 1 (…) - G. (of Nancy) Gardet 1 (…) - G.(of Bavaria) Gerber 1 (…) - G.I. (H.J.) Sëmina 1 (…) - Geoffrey S. ('Geoff') Hall 1 (…) - Georges André (1888–1973) 1 (…) - Gustav Adolf Ferdinand Eichler (1835-1906) 1 (…) - Giles E. (St J.) Hardy 1 (…) - H.(of Freiburg) Schmidt 1 (…) - H.J. (`Harry') Hudson 1 (…) - Ion(Ioan,Joan) C. Constantineanu 1 (…) - Jac (N.J.) Gelderblom 1 (…) - James Michael ('Jim') Miller 1 (…) - Jennifer Anne ('Jenny') Davidson 1 (…) - Robert J. Ferry (Sr.) 1 (…) - Robert Leroy ('Bob') Hanrahan 1 (…) - Ronald L. ('Ron') Exeter 1 (…) - Russell J. ('Rusty') Rodriguez 1 (…) - Rüdiger Felix (Ruggero Felice) Solla 1 (…) - Sandra L. ('Sandie') Baldauf 1 (…) - Saun-ichirô (Shun-ichirô) Imamura 1 (…) - Stig (Gunnar Anton) Waldheim 1 (…) - Stip (B.R.) Helleman 1 (…) - Søren Sørensen (1873-1926) 1 (…) - Thomas Cooper (botanist) 1 (…) - Vlk Valenta (1925-2010) 1 (…) - William Vernon (c. 1666-1711) 1 (…) - Yin Chan(Ch'An) Wu 1 (…) - Yvette Berenice ('Tivvy') Harvey

2 (…) - (Axel) Helge (Svensson) Stenar 2 (…) - (Carl) Julius (Adolf) Scharlock 2 (…) - Geoff(rey) (S.) Pegg 2 (…) - (Georg) Emil (Carl Christoph) Schuez 2 (…) - (J.A.A.)M.(H.) Goossens-Fontana 2 (…) - (Ludwig) Bernhard (Ehregott) Schmid 2 (…) - Matt(hew) (J.) Trappe 2 (…) - (Philippe) Victoire (Lévêque) de Vilmorin 2 (…) - (Theodor) Julius (Reinhold) von Schröder

3 (…) - (Johan) Fredrik(Friedrich) (Eberhard) Svanlund

infinite-dao commented 10 months ago

Names of rulers or nobles aso.

In our botanical data (https://dr.jacq.org/DR014960) there are also names such as Friedrich August II.,König von Sachsen s.n., and this is currently processed not quite right, there is also a name standardisation in Germany (see https://explore.gnd.network/en/gnd/118917218). If you parse this name in different ways, you get the following result (dwc_agent 3.0.16.0):

-----------
input: Friedrich August II.
{"family":"August","given":"Friedrich","suffix":"II."}
-----------
input: Friedrich August II.,König von Sachsen
{"family":"August","given":"Friedrich","suffix":"II."}
{"family":"Sachsen","given":"König","particle":"von"}
-----------
input: Friedrich August II.,König von Sachsen s.n.
output: []

… the last case should output something, but does not :thinking: …

If you take the name „Friedrich August II,König von Sachsen“ strictly, you could understand it, or see it like this:

{"family":"", "given": "August Friedrich", "suffix": "II.", "title": "König von Sachsen"}

or (if necessary)

{"family": "König von Sachsen", "given": "August Friedrich", "suffix": "II."}

... since August and Friedrich are actually first names in German

The same naming problem arises with e.g. „August der Starke“, as https://explore.gnd.network/en/gnd/118505084 contains the standard data, and would be standardised as: „August II, Polen, König“, and it makes parsing very tricky – so in both cases there is no family name in the strict sense, is there? The same might apply to English names aso..