Open infinite-dao opened 1 year ago
Thanks for the great work on this. The A. Charpin, P. Hainard & R. Salanon
case was a too greedy use of 'anon' in a BLACKLIST and that's now fixed. The variable placement of 'von' in the source should also be accommodated now in v3.0.9 just pushed. The Álvarez de Zayas,A.
example requires some more investigation.
Thank you for taking on this issue.
Here are some more ;-)
farmer
seems actually a name (perhaps in the past more than in the presence)
related_parsed_name | after cleaning | source string | comment |
---|---|---|---|
L.E. Bureau |
at cleaned_0index:0 | Bureau,L.E. |
|
Émil Bureau |
at cleaned_0index:0 | Bureau,Émil |
|
E.R. Guaglianone |
at cleaned_0index:2 | Burkart A., Troncoso,N.S., Guaglianone,E.R., Rotman,A., Botta,S. & Buck,H. |
|
P. Classe |
at cleaned_0index:0 | Classe,P. & Gebauer,R. |
|
G. Classen |
at cleaned_0index:0 | Classen,G. |
|
R. Claßen |
at cleaned_0index:0 | Claßen,R. & Hagemann,I. |
|
J.B.L. Companyo |
at cleaned_0index:0 | Companyo,J.B.L. |
|
Farmer Braun |
at cleaned_0index:0 | Farmer Braun |
|
Theodor Magnus Fries |
at cleaned_0index:0 | Fries,Theodor (Thore) Magnus & & al. |
probably also faulty input |
Goetzen Graf von |
at cleaned_0index:0 | Goetzen Graf von & Maire,C. |
|
E.R. Guaglianone |
at cleaned_0index:0 | Guaglianone,E.R. & Múlgura,M.E. |
|
H u. S |
at cleaned_0index:0 | H u. S |
|
H. d. D. T. |
at cleaned_0index:0 | H. d. D. T. |
|
R. Claßen |
at cleaned_0index:1 | Hagemann,I. & Claßen,R. |
|
J. Poel van de |
at cleaned_0index:3 | Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R. |
|
Heldreich De |
at cleaned_0index:0 | Heldreich De |
|
De |
at cleaned_0index:1 | Heldreich De & Halácsy,E. von |
|
Theodor Heinrich |
at cleaned_0index:1 | Heldreich,Theodor Heinrich von |
|
I. B. |
at cleaned_0index:0 | I. B. |
|
L. B. |
at cleaned_0index:0 | L. B. |
|
von der |
at cleaned_0index:1 | Marck,J.W.C.T., von der |
probably also faulty input |
Martius. C.F.P. von |
at cleaned_0index:0 | Martius. C.F.P. von (no. Herb. Fl. Bras. 483) |
|
D.B. Poindexter |
at cleaned_0index:1 | Nelson,J.B. & Poindexter,D.B. |
|
G. Classen |
at cleaned_0index:1 | Raadts,E. & Classen,G. |
|
August Leopold von |
at cleaned_0index:1 | Reuss,August Leopold von & Reuss,A.L. von |
|
Theodor Schube |
at cleaned_0index:0 | Schube,Theodor |
|
A. Senoner |
at cleaned_0index:0 | Senoner,A. |
|
D. Stafford |
at cleaned_0index:0 | Stafford,D. |
|
Theodor Strauss |
at cleaned_0index:0 | Strauss,Theodor |
|
Theusch |
at cleaned_0index:0 | Theusch |
|
Theusz |
at cleaned_0index:0 | Theusz |
|
Theusz |
at cleaned_0index:0 | Theusz (no. 517) |
|
Thevenon |
at cleaned_0index:0 | Thevenon |
|
Toward |
at cleaned_0index:0 | Toward |
|
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes |
|
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes de |
|
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes,B.E.E. de |
|
Winter de |
at cleaned_0index:0 | Winter de & Hardy |
|
Theodor Wolf |
at cleaned_0index:0 | Wolf,Theodor |
Here some more names from Meise data:
related_parsed_name | after cleaning empty | source string | comment |
---|---|---|---|
J. B. |
at cleaned_0index:0 | ? J. B. |
|
F.M.C. V. |
at cleaned_0index:0 | ?F.M.C. V. |
|
Jean Malvaux sc. |
at cleaned_0index:0 | ?Jean Malvaux sc. |
|
B. K. |
at cleaned_0index:1 | ?P.E.G. & B. K. |
|
A F. |
at cleaned_0index:0 | A F. |
|
A W. |
at cleaned_0index:0 | A W., Herb. François Crépin |
|
A |
at cleaned_0index:0 | A, i J. Kornasiowie |
|
A. B |
at cleaned_0index:0 | A. B |
|
A. B |
at cleaned_0index:0 | A. B[auy] [= Braun A.?] |
|
P.M. A. |
at cleaned_0index:2 | A. Besga, M.A. Domingo, A., P.M. & X. Uribe-Echebarría |
|
A. Blytt in Musée Christiania in F. Crépin |
at cleaned_0index:0 | A. Blytt in Musée Christiania in F. Crépin - Herbarium rosarum |
OK difficult a collection name of some kind |
A. R |
at cleaned_0index:0 | A. R[ac][] |
|
A. R |
at cleaned_0index:0 | A. R[aesdonk] |
|
A. R |
at cleaned_0index:0 | A. R[oc][] |
|
ABaillot |
at cleaned_0index:0 | ABaillot |
perhaps very abbreviated, or misstake |
ABeck |
at cleaned_0index:0 | ABeck |
perhaps very abbreviated, or misstake |
ABecke |
at cleaned_0index:0 | ABecke |
perhaps very abbreviated, or misstake |
ABecre |
at cleaned_0index:0 | ABecre |
perhaps very abbreviated, or misstake |
ABeneschi |
at cleaned_0index:0 | ABeneschi |
perhaps very abbreviated, or misstake |
ABlol |
at cleaned_0index:0 | ABlol |
perhaps very abbreviated, or misstake |
ABunge |
at cleaned_0index:0 | ABunge |
perhaps very abbreviated, or misstake |
AHes |
at cleaned_0index:0 | AHes |
perhaps very abbreviated, or misstake |
AHlvay |
at cleaned_0index:0 | AHlvay |
perhaps very abbreviated, or misstake |
AKessemantt |
at cleaned_0index:0 | AKessemantt |
perhaps very abbreviated, or misstake |
AKranz |
at cleaned_0index:0 | AKranz |
perhaps very abbreviated, or misstake |
ALetourneux |
at cleaned_0index:0 | ALetourneux |
perhaps very abbreviated, or misstake |
ALinae |
at cleaned_0index:0 | ALinae |
perhaps very abbreviated, or misstake |
ALongo |
at cleaned_0index:0 | ALongo |
perhaps very abbreviated, or misstake |
ALux |
at cleaned_0index:0 | ALux |
perhaps very abbreviated, or misstake |
ANlemtin |
at cleaned_0index:0 | ANlemtin |
perhaps very abbreviated, or misstake |
ARothms |
at cleaned_0index:0 | ARothms |
perhaps very abbreviated, or misstake |
ARukimo |
at cleaned_0index:0 | ARukimo |
perhaps very abbreviated, or misstake |
ARusland |
at cleaned_0index:0 | ARusland |
perhaps very abbreviated, or misstake |
ASchumacher |
at cleaned_0index:0 | ASchumacher |
perhaps very abbreviated, or misstake |
ASecri |
at cleaned_0index:0 | ASecri |
perhaps very abbreviated, or misstake |
ASeegreg |
at cleaned_0index:0 | ASeegreg |
perhaps very abbreviated, or misstake |
AVDHangls |
at cleaned_0index:0 | AVDHangls |
perhaps very abbreviated, or misstake |
AValon |
at cleaned_0index:0 | AValon |
perhaps very abbreviated, or misstake |
AVirans |
at cleaned_0index:0 | AVirans |
perhaps very abbreviated, or misstake |
AViranz |
at cleaned_0index:0 | AViranz |
perhaps very abbreviated, or misstake |
AVranz |
at cleaned_0index:0 | AVranz |
perhaps very abbreviated, or misstake |
AWal |
at cleaned_0index:0 | AWal[raven] |
perhaps very abbreviated, or misstake |
AWalcana |
at cleaned_0index:0 | AWalcana |
perhaps very abbreviated, or misstake |
AWariouy |
at cleaned_0index:0 | AWariouy |
perhaps very abbreviated, or misstake |
AWeatherby |
at cleaned_0index:0 | AWeatherby |
perhaps very abbreviated, or misstake |
D. |
at cleaned_0index:0 | Abbé D. [Duyany] |
|
P |
at cleaned_0index:0 | Abbé P[]ant |
|
Ad de |
at cleaned_0index:0 | Ad de [Genssen] |
|
Farmer D. |
at cleaned_0index:2 | Akeroyd J., Brookes S., Farmer D. & Jury S. |
|
Arnaud Ch. |
at cleaned_0index:0 | Arnaud Ch. |
|
Arnold Cl. |
at cleaned_0index:0 | Arnold Cl. |
|
Arnold Fr. |
at cleaned_0index:0 | Arnold Fr. |
|
Becker J. Ph. |
at cleaned_0index:0 | Becker J.Ph. |
|
Bell C.R |
at cleaned_0index:0 | Bell C.R & Bell S.F. |
|
Bequet A.B |
at cleaned_0index:0 | Bequet A.B |
|
Bertolani Ch. |
at cleaned_0index:0 | Bertolani (o) Ch. |
|
Bidgood S. at al. |
at cleaned_0index:0 | Bidgood S. at al. |
|
Binder Em. |
at cleaned_0index:0 | Binder Em. |
|
M.L. Blickenstaff |
at cleaned_0index:0 | Blickenstaff M.L. |
|
Bouillenne Cl. |
at cleaned_0index:0 | Bouillenne Cl. |
|
Granville J.J. De |
at cleaned_0index:0 | Granville J.J. De, Acevedo P., Boyer A., Hollenberg L. |
|
Reichgelt Th. |
at cleaned_0index:1 | van Ooststroom S.J. & Reichgelt Th. |
Yes, it is getting better, very good.
With dwcagent
version 3.0.11.0 I have parsed names again and some contain two first names connected with an “and” and only one family name, so some kind of abbreviated spelling, for example: «R. Mizuno & C. W. and L. B. O'Brien
», which could be resolved as «R. Mizuno & C. W. O'Brien & L. B. O'Brien
», but difficult, as there are other “and”-stringings as in: «Puerto Rico and the Mona Passage & Desecheo Is.
», which seems mixed input, and hence unable to resolve correctly.
Here are cases where the first one is correctly parsed and cleaned, also interpreted as I would expect it to, and in the second case one name is dropped, somehow unexpected:
NAMELIST=(
"M. and C. Jaschhof & Project"
"S. Yu. and N. V. Kuznetsov"
)
IFS=''
for text in ${NAMELIST[*]}; do
echo "input: $text"
dwcagent "${text}" | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS
… gives:
input: M. and C. Jaschhof & Project
{"family":"Jaschhof","given":"M."}
{"family":"Jaschhof","given":"C."}
input: S. Yu. and N. V. Kuznetsov
{"family":"Kuznetsov","given":"N.V."}
So here are some “and”-concatenations, I hope the comments make any sense (with “midst-abbreviated spelling”, I mean in the midst of the name list):
related_parsed_name | after cleaning empty | source string | comment |
---|---|---|---|
H. W. |
at cleaned_0index:5 | Baltic Amber & Prussian Fm. & C. and H. W. Hoffeins & C. & H. W. |
kind of midst-abbreviated spelling |
M. S. |
at cleaned_0index:2 | Blackdown Tableland & Expedition Rg. & M. S. and B. J. Moulds |
kind of midst-abbreviated spelling |
R. H. |
at cleaned_0index:2 | C. Liang & W. LaBerge & R. H. and L. D. Beamer |
kind of midst-abbreviated spelling |
C. W. |
at cleaned_0index:0 | C. W. and L. B. O’Brien |
kind of midst-abbreviated spelling |
C. W. |
at cleaned_0index:0 | C. W. and L. B. O'Brien & R. Mizuno |
kind of midst-abbreviated spelling |
C. W. |
at cleaned_0index:0 | C. W. and L. O'Brien |
kind of midst-abbreviated spelling |
C. W. |
at cleaned_0index:0 | C. W. and L. O'Brien & G. Wibmer |
kind of midst-abbreviated spelling |
D.T. Le |
at cleaned_0index:0 | D.T. Le, D.T. Truong, H.Q. Nguyen, N.H. Nguyen, and A.N. Nguyen |
here the name separator seems as regex: /(,\s+\|,\s+and\s+)/ |
M. V. |
at cleaned_0index:1 | E. A. Yagmur & M. V. and S. V. Nabozhenko & B. Keskin & I. Chigr |
kind of midst-abbreviated spelling |
No |
at cleaned_0index:2 | F. W. and S. K. Gess & No |
kind of midst-abbreviated spelling |
M. V. |
at cleaned_0index:1 | Islahiye District, W & M. V. and S. V. Nabozhenko & B. Keskin |
kind of midst-abbreviated spelling |
team |
at cleaned_0index:1 | J. A McGuire and team |
correctly cleaned |
R. J. B. Rockshop field crew |
at cleaned_0index:1 | Japh Boyce and R. J. B. Rockshop field crew |
correctly cleaned, one could use the non-cleaned parsed result |
J. Cl. |
at cleaned_0index:0 | J. Cl. and P. Gauthier |
kind of midst-abbreviated spelling |
J. H. |
at cleaned_0index:0 | J. H., A. M. and A. W. Skevington |
kind of midst-abbreviated spelling |
A. M. |
at cleaned_0index:1 | J. H., A. M. and A. W. Skevington |
kind of midst-abbreviated spelling |
A.M. J.H. |
at cleaned_0index:0 | J.H., A.M. and A.W. Skevington |
kind of midst-abbreviated spelling |
J. C. R. |
at cleaned_0index:2 | J. P. W. Hall & K. R. Willmott & J. C. R. and J. I. R. Willmott |
kind of midst-abbreviated spelling |
team |
at cleaned_0index:1 | Lamas, Nihei and team |
correctly cleaned |
L. B. |
at cleaned_0index:0 | L. B. and C. W. O' Brien |
kind of midst-abbreviated spelling |
J. H. |
at cleaned_0index:0 | Leg. & J. H., A. W. and A. M. Skevington |
mixed input, but kind of 3-fold abbreviated spelling |
A. W. |
at cleaned_0index:1 | Leg. & J. H., A. W. and A. M. Skevington |
mixed input, but kind of 3-fold abbreviated spelling |
L. E. L. |
at cleaned_0index:0 | L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton |
kind of midst-abbreviated spelling |
M. F. V. |
at cleaned_0index:1 | L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton |
kind of midst-abbreviated spelling |
M. F. V. Skeleton |
at cleaned_0index:3 | L. E. L., M. F. V. and S. D'Angelo Neto & M. F. V. Skeleton |
kind of midst-abbreviated spelling |
M. B. |
at cleaned_0index:2 | Male, WA & Fortescue R. & M. B. and B. J. Moulds |
kind of midst-abbreviated spelling |
Project |
at cleaned_0index:2 | M. and C. Jaschhof & Project |
correctly parsed and cleaned |
McIlwrath Ra. |
at cleaned_0index:0 | McIlwrath Ra. & G. and A. Daniels. R. Eastwood |
kind of midst-abbreviated spelling |
M. J. |
at cleaned_0index:0 | M. J. and M. - L. Penrith |
kind of abbreviated spelling |
M. V. |
at cleaned_0index:1 | M. V. Nabozhenko & M. V. and S. V. Nabozhenko & B. Keskin |
kind of midst-abbreviated spelling |
T. H. |
at cleaned_0index:1 | New Zealand H. B. Mohi Bush Waimarama & T. H. and J. M. Davies & |
kind of midst-abbreviated spelling |
C. W. |
at cleaned_0index:1 | N Metzquititlan & C. W. and L. O'Brien & G. Wibmer |
kind of midst-abbreviated spelling |
A. M. |
at cleaned_0index:1 | Nullagine & A. M. and M. J. Douglas |
kind of midst-abbreviated spelling |
Pedro M. Ruiz-Carranza. Adult |
at cleaned_0index:0 | Pedro M. Ruiz-Carranza. Adult male (ICN 19727), and adult female |
mixed input, hence unable to resolve correctly |
C. W. |
at cleaned_0index:1 | P. N. Braulio Carrillo & C. W. and L. B. O'Brien |
kind of midst-abbreviated spelling |
the Mona Passage |
at cleaned_0index:1 | Puerto Rico and the Mona Passage & Desecheo Is. |
mixed input, hence unable to resolve correctly |
Desecheo Is. |
at cleaned_0index:2 | Puerto Rico and the Mona Passage & Desecheo Is. |
mixed input, hence unable to resolve correctly |
the Mona Passage |
at cleaned_0index:1 | Puerto Rico and the Mona Passage & F. Fisk |
mixed input, hence unable to resolve correctly |
R. L. |
at cleaned_0index:0 | R. L., and B. B. Brown |
here the name separator seems as regex: /(,\s+\|,\s+and\s+)/ |
C. W. |
at cleaned_0index:1 | R. Mizuno & C. W. and L. B. O'Brien |
kind of midst-abbreviated spelling |
M. S. |
at cleaned_0index:1 | Roper R. & M. S. and B. J. Moulds |
kind of midst-abbreviated spelling |
M. E. |
at cleaned_0index:1 | S. San Pedro Sula & M. E. and P. D. Perkins |
kind of midst-abbreviated spelling |
S. Yu. |
at cleaned_0index:0 | S. Yu. and N. V. Kuznetsov |
kind of abbreviated spelling |
M. V. |
at cleaned_0index:2 | W Antakya, S & M. V. and S. V. Nabozhenko & B. Keskin |
kind of midst-abbreviated spelling |
M. S. |
at cleaned_0index:1 | Wheeny Ck. & M. S. and B. J. Moulds & G. Williams & W. Newham & |
kind of midst-abbreviated spelling |
The example S. Yu. and N. V. Kuznetsov
is a tricky one. Is that a misplaced period after a family name Yu
or is it a pair of collectors that share the same family name Kuznetsov
like [#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]
. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.
The example C. W. and L. O'Brien
has been fixed in the working code with a spec test and will be released soon. The late Charlie O'Brien and his very much living wife Lois O'Brien would be pleased.
The rest of the mixed uses of &/and alongside family members with interlopers I fear will be too challenging to tackle. However, the examples like J. H., A. M. and A. W. Skevington
should be easy enough to accommodate (though rare). It's not at all common than three family members collect together, but Jeff Skevington and family might also be pleased if I can do this.
Prematurely closed via a commit message.
The example
S. Yu. and N. V. Kuznetsov
is a tricky one. Is that a misplaced period after a family nameYu
or is it a pair of collectors that share the same family nameKuznetsov
like[#<Name family="Kuznetsov" given="S. Yu.">, #<Name family="Kuznetsov" given="N. V.">]
. I'd assume it's the former and not the latter, but to accommodate this is one of those examples where making an assumption will break accommodation elsewhere.
Yes, true, it’s difficult—I’m not sure if it’s the same person (because of the botanical data background)— one S. Yu. Kuznetsov is here as an example: https://www.researchgate.net/profile/S-Kuznetsov-3. So at least a real “life”-example.
Here are some form dwcagent 3.0.12.0, and botanical names in the context of BGBM Berlin:
related_parsed_name | after cleaning empty | source string | comment |
---|---|---|---|
F.A. Marschall v. Bieberstein |
at cleaned_0index:0 | Bieberstein,F.A. Marschall v. |
unexpectedly dropped from cleaning—is it a ?bug in dwcagent version 3.0.12.0 |
N.H. Le |
at cleaned_0index:2 | Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V. |
unexpectedly dropped related_parsed_name somehow: should «N.H. Le» not be dropped on cleaning(?) |
Goetzen Graf von |
at cleaned_0index:0 | Goetzen Graf von & Maire,C. |
unexpectedly dropped related_parsed_name—is it regarded as a name too general, like Princess of Bavaria or something? |
J. Poel van de |
at cleaned_0index:3 | Hammel,B., Chatrou,L., Pérez,I., Poel van de,J. & Wilschut,R. |
unexpectedly dropped related_parsed_name |
Heldreich De |
at cleaned_0index:0 | Heldreich De |
better curate source data |
De |
at cleaned_0index:1 | Heldreich De & Halácsy,E. von |
better curate source data |
I. B. |
at cleaned_0index:0 | I. B. |
better curate source data |
L. B. |
at cleaned_0index:0 | L. B. |
better curate source data |
von der |
at cleaned_0index:1 | Marck,J.W.C.T., von der |
is probably Johann Wilhelm Carl Theodor von der Marck |
Martius. C.F.P. von |
at cleaned_0index:0 | Martius. C.F.P. von (no. Herb. Fl. Bras. 483) |
is probably Carl Friedrich Philipp von Martius but family name should be curated, not being abbreviated |
D.B. Poindexter |
at cleaned_0index:1 | Nelson,J.B. & Poindexter,D.B. |
unexpectedly dropped related_parsed_name |
August Leopold von |
at cleaned_0index:1 | Reuss,August Leopold von & Reuss,A.L. von |
difficult to guess the only two names concatenated by «&», probably a wrong parser interpretation—it seems also better to curate source data |
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes |
I guess missing information in source data, better curate source data |
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes de |
I guess missing information in source data, better curate source data |
Wilde de |
at cleaned_0index:0 | Wilde de & Wilde-Duyfjes,B.E.E. de |
I guess missing information in source data, better curate source data |
Winter de |
at cleaned_0index:0 | Winter de & Hardy |
I guess missing information in source data, better curate source data |
Á E. |
at cleaned_0index:7 | Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C. |
as dicussed previously, wrong input data at «Köhler, E. …», no way to guess it right from parser perspective |
dwcagent "Galán,P. & Montenegro,S.M."
# []
… gets parsed to be unexpectedly empty in dwcagent 3.0.12.0
Checking it:
NAMELIST=(
"Galán,P."
"Montenegro,S.M."
"Galán,P. & Montenegro,S.M."
)
IFS=''
for text in ${NAMELIST[*]}; do
echo -en "\n-----------\ninput: $text\n"
dwcagent "${text}" | jq -c '.[] | with_entries(select(.value |.!=null))'
done
unset $IFS
… we get:
-----------
input: Galán,P.
{"family":"Galán","given":"P."}
-----------
input: Montenegro,S.M.
-----------
input: Galán,P. & Montenegro,S.M.
A few new updates now in v 3.0.13.0:
Bieberstein,F.A. Marschall v.
regression is restored.
Bollendorff,S., Dang,T.T.H., Le,N.H., Nguyen,G.D., Raab-Straube,E.v. & Truong,B.V.
now parses. The Le
was aggressively cleaned from a blacklist.
Galán,P. & Montenegro,S.M.
now parses. Montenegro
was in a country blacklist.
Nelson,J.B. & Poindexter,D.B.
now works. The greedy regex found index
in Poindexter
.
Hi David,
sometimes, if there are square brackets in the middle of the names, with probable letters, e.g. Buto[m]a
, then when it is parsed in dwc_agent: 3.0.13.0, it is split into parsed:Buto a
and cleaned:Buto A
, so an artificial first name is created, in this case. Actually, if the brackets were just ignored, it would be better if Buto[m]a
came out Butoa
and not Buto A
, or Malaisse Matera;Wa[s]terlain
to Malaisse Matera<SEP>Waterlain
, right?
The attachment contains many of these examples from Meise name data with square brackets, where they are in the name string itself:
I think a good strategy would be to remove the [...]
if they are immediately inside a string, between the letters, and then just remove them without inserting a space.
We are getting there — almost :grin: … here with dwc_agent: 3.0.14.0 I found some unexpected results:
[]
is difficult to interpret, perhaps just remove it completely and join adjacent characters#!/bin/bash
NAMELIST=(
"C. Be[n]ed. [St]enn" # would expect Stenn
"Abb. No[yr]ey, abb. Faure" # would expect Abb. Faure
"Abbé F[r]. Hy in Ch. Flahault" # would expect dot in Fr.
"An[s]on" # would expect Anson
"Arm. A[n]spach" # would expect dot to keep in Arm.
"Attila Meste[r]há[zy]" # would expect Mesterházy
"Conzatti, [Holned] & Ordó[ñ]er" # would expect parsed: Conzatti<SEP>Holned<SEP>Ordóñer
"Corn. C[oriz]e" # would expect dot to keep
"Dr. B[on]bier" # would expect: Bonbier
"Dr. [B] [B]oiglaender-[T]ekn[]" # would expect parsed: Dr. B Boiglaender-Tekn
"G.[Char][][g][] in F. Crépin - Herbarium Rosarum" # would expect: G.Charg in F. Crépin
"Ga[r]dama[uic][][te]" # would expect: Gardamauicte
# nothing parsed so far the following:
"R. Ba[renda]/J.Willemsen" # is parsed all right the following not
"R. Barendse/[J]. Willemse" # nothing parsed?
"V.[S]. Nikitin - T.I Zhilenko" # nothing parsed?
"L. [G][][trofi][]" # would expect: L. Gtrofi
"B.E. & J.[G]. Juniper" # would expect parsed: B.E.<SEP>J.G. Juniper
# or interpreted parsed: B.E. Juniper<SEP>J.G. Juniper
# is dwcagent "B.E." expected to be parsed or not, I guess not?
"D. Ba[nsk], M. Mathais, R.E.W. jr" # nothing parsed?
"C[o]l" # nothing parsed?
"C.[R]. Gerard" # nothing parsed?
"Comm. K.[j]. Cameron" # nothing parsed?
"L.[G]. Ravaud" # nothing parsed?
"D.[f]. [G]i[lf]illa[n]" # nothing parsed?
"M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal" # nothing parsed?
"N.S.[q]. in M.R. Clarke" # nothing parsed?
)
IFS=''
for text in ${NAMELIST[*]}; do
echo -en "-----------\ninput: $text\n"
results=$(dwcagent "${text}")
if [[ "${results-}" == "[]" ]];then
echo "output: $results"
else
echo "$results" | jq -c '.[] | with_entries(select(.value |.!=null))'
fi
done
unset $IFS
-----------
input: C. Be[n]ed. [St]enn
{"family":"Enn","given":"C. Bened"}
-----------
input: Abb. No[yr]ey, abb. Faure
{"family":"Noyrey","given":"Abb"}
{"family":"Faure","given":"AbB."}
-----------
input: Abbé F[r]. Hy in Ch. Flahault
{"family":"Ch. Flahault","given":"Fr Hy","particle":"in","title":"Abbé"}
-----------
input: An[s]on
{"family":"Ans"}
-----------
input: Arm. A[n]spach
{"family":"Anspach","given":"Arm"}
-----------
input: Attila Meste[r]há[zy]
{"family":"Mesterh","given":"Attila"}
-----------
input: Conzatti, [Holned] & Ordó[ñ]er
{"family":"Conzatti"}
{"family":"Er","given":"Ord"}
-----------
input: Corn. C[oriz]e
{"family":"Corize","given":"Corn"}
-----------
input: Dr. B[on]bier
{"family":"Bier","given":"B.","title":"Dr."}
-----------
input: Dr. [B] [B]oiglaender-[T]ekn[]
{"family":"Ekn","given":"Oiglaender","title":"Dr."}
-----------
input: G.[Char][][g][] in F. Crépin - Herbarium Rosarum
{"family":"F. Crépin","given":"G.","particle":"in"}
-----------
input: Ga[r]dama[uic][][te]
{"family":"Te","given":"Gardamauic"}
-----------
input: R. Ba[renda]/J.Willemsen
{"family":"Barenda","given":"R."}
{"family":"Willemsen","given":"J."}
-----------
input: R. Barendse/[J]. Willemse
output: []
-----------
input: V.[S]. Nikitin - T.I Zhilenko
output: []
-----------
input: L. [G][][trofi][]
output: []
-----------
input: B.E. & J.[G]. Juniper
output: []
-----------
input: D. Ba[nsk], M. Mathais, R.E.W. jr
output: []
-----------
input: C[o]l
output: []
-----------
input: C.[R]. Gerard
output: []
-----------
input: Comm. K.[j]. Cameron
output: []
-----------
input: L.[G]. Ravaud
output: []
-----------
input: D.[f]. [G]i[lf]illa[n]
output: []
-----------
input: M. Mayor, J. A . Fdez. P[rie]to, C. Fdez. Carvujal
output: []
-----------
input: N.S.[q]. in M.R. Clarke
output: []
With the v.3.0.15.0 release, I think we're getting pretty close to accommodating many of the issues identified above. Some were intractable & some meant re-evaluating the strict spec tests.
In 3.0.16.0 I parsed an example with comma (real given example) and the interpreted same name without comma, and the results are different, should they not be the same? Using the dwcagent
wrapper I get:
-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}
… Reyna de Aguilar,M.L.
is the the actual example (e.g. from «Flores,J., Montalvo,E.A., Reyna de Aguilar,M.L. & Calderón,M.
») and M.L. Reyna de Aguilar
is just the reverse of the comma concatenating version, like so:
Reyna de Aguilar,M.L.
………………………………………… ____
↘ ↙
⋅ ↙
⋅ ↙
⋅ ↙
↙ ⋅
↙ ⋅
↙ ⋅
↙ ↘
____ …………………………………………
M.L. Reyna de Aguilar
It is difficult to analyse the name particles — for instance the https://en.wikipedia.org/wiki/Nobiliary_particle or https://en.wikipedia.org/wiki/German_nobility#Nobiliary_particles — and they can describe a place or another family name so to say, I think, e.g.
… so it is difficult to assign the name parts to either …
I think this is not solvable 100%, because there are too many combinations, I just want to mention it here.
Looks like your variously arranged names, particles and initials:
-----------
input: Reyna de Aguilar,M.L.
{"family":"Aguilar","given":"M.L.","particle":"Reyna de"}
-----------
input: M.L. Reyna de Aguilar
{"family":"Aguilar","given":"M. L. Reyna","particle":"de"}
Confuse the underlying Namae
gem, https://github.com/berkmancenter/namae/issues upon which dwc_agent
is dependent on.
By the way: brackets, this time round parentheses name (...)
:grin: — in dwc_agent
(3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?
Examples I found from WikiData with parentheses and are candidates that I thought to standardize by dwc_agent
as well, are the following:
-----------
input: Gustav Adolf Ferdinand Eichler (1835-1906)
{"family":"Eichler","given":"Gustav Adolf Ferdinand"}
-----------
input: Søren Sørensen (1873-1926)
{"family":"Sørensen","given":"Søren"}
-----------
input: Georges André (1888–1973)
{"family":"André","given":"Georges"}
-----------
input: Johannes Johannessen (1904-1990)
{"family":"Johannessen","given":"Johannes"}
-----------
input: Helge Buen (1918-2005)
{"family":"Buen","given":"Helge"}
-----------
input: Vlk Valenta (1925-2010)
{"family":"Valenta","given":"Vlk"}
-----------
input: Amalesh Choudhury (bot.)
{"family":"Choudhury","given":"Amalesh"}
-----------
input: Bror Pettersson (botaniker)
{"family":"Pettersson","given":"Bror"}
-----------
input: Robert W. Jones (botanist)
{"family":"Jones","given":"Robert W."}
-----------
input: Thomas Cooper (botanist)
{"family":"Cooper","given":"Thomas"}
-----------
input: Yi Huang (botanist-1)
{"family":"Huang","given":"Yi"}
-----------
input: William Vernon (c. 1666-1711)
{"family":"Vernon","given":"William"}
-----------
input: James Smith (diatomist)
{"family":"Smith","given":"James"}
-----------
input: Josep María Vidal(-Frigola)
{"family":"Vidal","given":"Josep María"}
-----------
input: István Balázs (instruisto)
{"family":"Balázs","given":"István"}
-----------
input: Hildur von Rettig (Lindberg)
{"family":"Rettig","given":"Hildur","particle":"von"}
-----------
input: Inger Kaasa (Magistad)
{"family":"Kaasa","given":"Inger"}
-----------
input: Kai Zhang (mycologist)
{"family":"Zhang","given":"Kai"}
-----------
input: Ting-Ting Zhang (mycologist)
{"family":"Zhang","given":"Ting-Ting"}
-----------
input: Robert J. Ferry (Sr.)
{"family":"Ferry","given":"Robert J."}
-----------
input: Phraya Wanpruekphichan (Thongkham Savetsila)
{"family":"Wanpruekphichan","given":"Phraya"}
-----------
input: Phraya Winitwanandon (To Komet)
{"family":"Winitwanandon","given":"Phraya"}
-----------
input: Maria Pavlovna Nagibina (Tsybulskaya)
{"family":"Nagibina","given":"Maria Pavlovna"}
-----------
input: O. Heylen (-Walraevens)
{"family":"Heylen","given":"O."}
-----------
input: Bill Kasongo (Wa Ngoy Kashiki)
{"family":"Kasongo","given":"Bill"}
I did not quite expect that round parenthesis would be removed entirely, and I did expected that it would filter cases out or so and leave some in. One compromise one could think of, is to try to simply add the last parenthesis content to the parsing field family
if it does not contain any dates or terms of occupations like botanist aso. — or shall this be documented that all round parenthesis content is removed?
By the way: brackets, this time round parentheses
name (...)
😁 — indwc_agent
(3.0.16.0) all names in parentheses disappear, is that the general idea, the concept, that all contents with parentheses are removed?
That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.
That was the general idea, yes. Most of the examples you have would seem to support the rationale, though admittedly there are some examples that appear to convey meaning (uncertainty perhaps?) about suffices , nicknames, or other implicit expression of family names / identity.
I completely agree with this. So perhaps let it that way.
I have further analysed the botanist names from WikiData and divided up the cases that appear: most of them are actually different versions of names, squeezed into one line, so to speak. Actually, from a data point of view, a correction would have to be made and two or more names with the same meaning would have to be unravelled. An interesting case are the names with only one letter in the parentheses. Here are some examples of botanical names from WikiData:
grep --invert-match --extended-regexp '^[^()]+\(\w{1,3}\W\)[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1.txt
grep --extended-regexp '^[^()]+\(\w{1,3}\W\)[^()]+$' Liste_ursprünglich.txt > Liste_ursprünglich_vermindert_1_Auszug.txt
Alan (C.) McKay Carolyn [Caroline] (A.) Young H.(Kh.) Karis Kh.(Ch.) G. Kulieva Yu.(Ju.) E. Petrov D.(T.) O'Gorman Robert J. Ferry (Jr.). Tatiana (Yu.) Gagkaeva
grep --invert-match --extended-regexp '^[^()]+\(\w{1}\)[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2.txt
grep --extended-regexp '^[^()]+\(\w{1}\)[^()]+$' Liste_ursprünglich_vermindert_1.txt > Liste_ursprünglich_vermindert_2_Auszug.txt
Anton(i) Wróblewski Dian Min(e) Chang Jacob Frederic(k) Brenckle …
grep --invert-match --extended-regexp '^[^()]+\(\w{1,2}\)[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3.txt
grep --extended-regexp '^[^()]+\(\w{1,2}\)[^()]+$' Liste_ursprünglich_vermindert_2.txt > Liste_ursprünglich_vermindert_3_Auszug.txt
Arthur C. (II) Grupe Bruno E.C. (de) Miranda Cun Ti (Di) Xiang Davi Mesquita (de) Macedo Julian(us) Hendrik Molkenboer Manuel (de) Assunção Diniz Marcos Antonio de (Jr) Morais Marcus (de) Melo Teixeira Priscila Sanjuan (de) Medeiros You (Yu) Wen Tsui
grep --invert-match --extended-regexp '^[^()]+\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4.txt
grep --extended-regexp '^[^()]+\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_3.txt > Liste_ursprünglich_vermindert_4_Auszug.txt
grep --extended-regexp '^[^()]+\w\(\w{1,}\)[^()]+$' Liste_ursprünglich_vermindert_4_Auszug.txt
Adolph(Adolf) Osterwalder Chris(toffel) F.J. Spies Constantine Demetry(Dmitriev) Sherbakoff Dénes(Dionisie) Pázmány Peng Yong(Yun) Zhang Ze(Tse) Xiang Peng Zu(Tsu) Tang Yin …
Aloysius (Luigi) Meschinelli Aldworth William (Tommy) Thompson Tina (Antje) Hofmann Wen (Wan) Jia Zhu Yin Tong (Tang) Xie …
cat Liste_ursprünglich_vermindert_4.txt | sed -r 's@^[^()]*\([^()]*\)[^()]*$@1 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@2 (…) - &@; s@^[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*\([^()]*\)[^()]*$@3 (…) - &@;' | sort
1 (…) - ('Wilson') Sze Wing Wong 1 (…) - (Alexandre Alexis) George Le Monnier 1 (…) - (Bartolomeo Giacomo) Rinaldo Corradi 1 (…) - (Léon Marie Joseph) Gustave Nicolas 1 (…) - (Antonius) Theodoor Wegelin 1 (…) - (B.) Alfred Steinmann 1 (…) - (M.)F. Weyhe 1 (…) - (Mrs.) Leslie Hofer [Mrs. Vincent W.] Lanfear 1 (…) - (Q.)S.Y. Yeung 1 (…) - (Sister) Little Flower 1 (…) - (Wilhelm) William Winkler 1 (…) - A.H.G. ('Bert') Gerrits van den Ende 1 (…) - Albert (Jacob Josef) Vandevelde 1 (…) - August (François Marie Antoine) Tonglet 1 (…) - Amalesh Choudhury (bot.) 1 (…) - Bror Pettersson (botaniker) 1 (…) - István Balázs (instruisto) 1 (…) - James Smith (diatomist) 1 (…) - Robert W. Jones (botanist) 1 (…) - Ting-Ting Zhang (mycologist) 1 (…) - Yi Huang (botanist-1) 1 (…) - Analy Salles (de Azevedo) Melo 1 (…) - Bill Kasongo (Wa Ngoy Kashiki) 1 (…) - Franz August (`Friedrich') Müller 1 (…) - Friedrich (`Franz') Joseph Schelver 1 (…) - Frédéric-Edouard (`Fritz') Kampmann 1 (…) - Félix de Azara (1742-1821) 1 (…) - G. (of Nancy) Gardet 1 (…) - G.(of Bavaria) Gerber 1 (…) - G.I. (H.J.) Sëmina 1 (…) - Geoffrey S. ('Geoff') Hall 1 (…) - Georges André (1888–1973) 1 (…) - Gustav Adolf Ferdinand Eichler (1835-1906) 1 (…) - Giles E. (St J.) Hardy 1 (…) - H.(of Freiburg) Schmidt 1 (…) - H.J. (`Harry') Hudson 1 (…) - Ion(Ioan,Joan) C. Constantineanu 1 (…) - Jac (N.J.) Gelderblom 1 (…) - James Michael ('Jim') Miller 1 (…) - Jennifer Anne ('Jenny') Davidson 1 (…) - Robert J. Ferry (Sr.) 1 (…) - Robert Leroy ('Bob') Hanrahan 1 (…) - Ronald L. ('Ron') Exeter 1 (…) - Russell J. ('Rusty') Rodriguez 1 (…) - Rüdiger Felix (Ruggero Felice) Solla 1 (…) - Sandra L. ('Sandie') Baldauf 1 (…) - Saun-ichirô (Shun-ichirô) Imamura 1 (…) - Stig (Gunnar Anton) Waldheim 1 (…) - Stip (B.R.) Helleman 1 (…) - Søren Sørensen (1873-1926) 1 (…) - Thomas Cooper (botanist) 1 (…) - Vlk Valenta (1925-2010) 1 (…) - William Vernon (c. 1666-1711) 1 (…) - Yin Chan(Ch'An) Wu 1 (…) - Yvette Berenice ('Tivvy') Harvey
2 (…) - (Axel) Helge (Svensson) Stenar 2 (…) - (Carl) Julius (Adolf) Scharlock 2 (…) - Geoff(rey) (S.) Pegg 2 (…) - (Georg) Emil (Carl Christoph) Schuez 2 (…) - (J.A.A.)M.(H.) Goossens-Fontana 2 (…) - (Ludwig) Bernhard (Ehregott) Schmid 2 (…) - Matt(hew) (J.) Trappe 2 (…) - (Philippe) Victoire (Lévêque) de Vilmorin 2 (…) - (Theodor) Julius (Reinhold) von Schröder
3 (…) - (Johan) Fredrik(Friedrich) (Eberhard) Svanlund
In our botanical data (https://dr.jacq.org/DR014960) there are also names such as
Friedrich August II.,König von Sachsen s.n.
, and this is currently processed not quite right, there is also a name standardisation in Germany (see https://explore.gnd.network/en/gnd/118917218). If you parse this name in different ways, you get the following result (dwc_agent 3.0.16.0):
-----------
input: Friedrich August II.
{"family":"August","given":"Friedrich","suffix":"II."}
-----------
input: Friedrich August II.,König von Sachsen
{"family":"August","given":"Friedrich","suffix":"II."}
{"family":"Sachsen","given":"König","particle":"von"}
-----------
input: Friedrich August II.,König von Sachsen s.n.
output: []
… the last case should output something, but does not :thinking: …
If you take the name „Friedrich August II,König von Sachsen“ strictly, you could understand it, or see it like this:
{"family":"", "given": "August Friedrich", "suffix": "II.", "title": "König von Sachsen"}
or (if necessary)
{"family": "König von Sachsen", "given": "August Friedrich", "suffix": "II."}
... since August and Friedrich are actually first names in German
The same naming problem arises with e.g. „August der Starke“, as https://explore.gnd.network/en/gnd/118505084 contains the standard data, and would be standardised as: „August II, Polen, König“, and it makes parsing very tricky – so in both cases there is no family name in the strict sense, is there? The same might apply to English names aso..
Hej-hej,
I’m aware of the difficult task to perform a good parsing and cleaning for all the name list cases out there. So here are some more names that get parsed sometimes and doing the cleaning they get lost (see attachment, dwcagent: 3.0.8.0, I used the wrapper https://github.com/infinite-dao/collector-matching/blob/main/bin/agent_parse4tsv.rb)
In our data of BGBM we often have like a regex name-list-separator
/(, | & )/
, e.g.:Anonymous collector & Humboldt,F.W.H.A. von
gets parsed differently with anonymous and without anonymous:… but without anonymous it is more right but not fully:
… only when the „von“ would be placed before the family name it would get right:
So, there are cases when parsing the particle, that could be improved. A difficult case—and I think it is partly not going to be solved from the parser—is:
Álvarez de Zayas,A., Beurton,C., Díaz,M.A., Dietrich,H., Duharte,Góngora,M.E., Gutiérrez,J., Köhler,E., Á,Leiva,Lepper,L., Rankin,R. & Sánchez,C.
—whereasKöhler,E., Á,Leiva,Lepper,L.,
is actually a mistake on the input part (not on parsing): the input in the source data should be corrected toKöhler,E., Leiva,Á, Lepper,L.,
—anyway, it gets to:… I’m not sure with
Álvarez de Zayas,A.
if the particle got right, it is Alberto Álvarez de Zayas (wikidata.org/wiki/Q13497940)The attached files are names from BGBM and Meise; in the logfile you can look in column
related_parsed_name
andcleaned_index_name_of_empty_result
is the index of a cleaned result that gets empty. Most names are Herbarium things or institutions but there are also some real names, e.g. in Meise:… the 3rd name seems missing:
Attached log files from parsing with wrapper
agent_parse4tsv.rb
(see https://github.com/infinite-dao/collector-matching/tree/main/bin — I hope the column names of the files are self explaining):VHde_doi-10.15468-dl.tued2e
(Virtual Herbarium Germany)) occurrence_recordedBy_occurrenceIDs_20230524_parsed.tsv_dwcagent_3.0.8.0.logMeise_doi-10.15468-dl.ax9zkh
) occurrence_recordedBy_eventDate_occurrenceIDs_20230830_parsed.tsv_dwcagent_3.0.8.0.log