colav / Kahi_plugins

Mono Repo for Kahi Plugins
BSD 3-Clause "New" or "Revised" License
0 stars 3 forks source link

Propuesta de `split_names` #180

Closed restrepo closed 4 months ago

restrepo commented 8 months ago
def split_names(s, exceptions=['GIL', 'LEW', 'LIZ', 'PAZ', 'REY', 'RIO', 'ROA', 'RUA', 'SUS', 'ZEA', 
                              'ANA','LUZ','SOL','EVA','EMA'],sep=':'):
    """
    Extract the parts of the full name `s` in the format ([] → optional):

    [SMALL_CONECTORS] FIRST_LAST_NAME [SMALL_CONECTORS] [SECOND_LAST_NAME] NAMES

    * If len(s) == 2 → Foreign name assumed with single last name on it
    * If len(s) == 3 → Colombian name assumed two last mames and one first name

    Add short last names to `exceptions` list if necessary

    Works with:
    ----
          'DANIEL ANDRES LA ROTTA FORERO',
          'MARIA DEL CONSUELO MONTES RAMIREZ',
          'RICARDO DE LA MERCED CALLEJAS POSADA',
          'MARIA DEL CARMEN DE LA CUESTA BENJUMEA',
          'CARLOS MARTI JARAMILLO OCAMPO NICOLAS',
          'DIEGO ALEJANDRO RESTREPO QUINTERO',
          'JAIRO HUMBERTO RESTREPO ZEA',
          'MARLEN JIMENEZ DEL RIO ',
          'SARA RESTREPO FERNÁNDEZ', # Colombian: NAME two LAST_NAMES
          'ENRICO NARDI', # Foreing
          'ANA ZEA',
          'SOL ANA DE ZEA GIL'
    Fails:
    ----
        s='RANGEL MARTINEZ VILLAL ANDRES MAURICIO' # more than 2 last names
        s='ROMANO ANTONIO ENEA' # Foreing → LAST_NAME NAMES
    """
    s = s.title()
    exceptions = [e.title() for e in exceptions]
    sl = sub('(\s\w{1,3})\s', fr'\1{sep}', s, UNICODE)  # noqa: W605
    sl = sub('(\s\w{1,3}%s\w{1,3})\s' %sep, fr'\1{sep}', sl, UNICODE)  # noqa: W605
    sl = sub('^(\w{1,3})\s', fr'\1{sep}', sl, UNICODE)  # noqa: W605
    # Clean exceptions
    # Extract short names list
    lst = [s for s in split(
        '(\w{1,3})%s' %sep, sl) if len(s) >= 1 and len(s) <= 3]  # noqa: W605
    # intersection with exceptions list
    exc = [value for value in exceptions if value in lst]
    if exc:
        for e in exc:
            sl = sl.replace('{}{}'.format(e,sep), '{} '.format(e))

    sll=sl.split()

    if len(sll) == 2:
        sll = [sl.split()[0]] + [''] + [sl.split()[1]]

    if len(sll) == 3:
        sll = [sl.split()[0]] + [''] + sl.split()[1:]

    d = {'NOMBRES':  [x.replace(sep,' ') for x in sll[:2] if x],
         'APELLIDOS': [x.replace(sep,' ') for x in sll[2:] if x],
         }
    d['INICIALES'] = [x[0]+'.' for x in d['NOMBRES']]

    return d

assert split_names('DANIEL ANDRES LA ROTTA FORERO')=={'NOMBRES': ['Daniel', 'Andres'],
  'APELLIDOS': ['La Rotta', 'Forero'],
  'INICIALES': ['D.', 'A.']}
assert split_names('MARIA DEL CARMEN DE LA CUESTA BENJUMEA')=={'NOMBRES': ['Maria', 'Del Carmen'],
  'APELLIDOS': ['De La Cuesta', 'Benjumea'],
  'INICIALES': ['M.', 'D.']}
restrepo commented 8 months ago

De momento sólo aceptar la búsqueda exacta de al menos algunos de los nombres y de al mentos el primer apellido:

Ejemplo:

full_name='Óscar Zapata' →d= {'NOMBRES': [oscar], 'APELLIDOS': [zapata]}

Comprobar al menos:

  1. d['NOMBRES'] in ['oscar','alberto']
  2. d['APELLIDOS'][0] == ['zapata','norena'][0] #primer apellido de la lista

Saludos Diego

omazapa commented 8 months ago

Hola @restrepo en este algoritmo propuesto solo se toman las iniciales del nombre y no del apellido ¿está bien así?

restrepo commented 8 months ago

Sí. Sólo iniciales para nombre no para apellido

omazapa commented 8 months ago

implemented in the package https://github.com/colav/kahi_impactu_utils added to kahi plugins in pr #199

omazapa commented 4 months ago

Hola profe está propuesta de algoritmo para partir nombres y apellidos está fallado con nombres muy básicos como

Bob Reynolds was parsed as ['Bob Reynolds'] [] 

por que los nombres que solo tienen 3 letras son omitidos

creo que podemos quitar esta parte del código

    exceptions = [e.title() for e in exceptions]
    sl = sub('(\s\w{1,3})\s', fr'\1{sep}', s, UNICODE)  # noqa: W605
    sl = sub('(\s\w{1,3}%s\w{1,3})\s' %sep, fr'\1{sep}', sl, UNICODE)  # noqa: W605
    sl = sub('^(\w{1,3})\s', fr'\1{sep}', sl, UNICODE)  # noqa: W605
    # Clean exceptions
    # Extract short names list
    lst = [s for s in split(
        '(\w{1,3})%s' %sep, sl) if len(s) >= 1 and len(s) <= 3]  # noqa: W605
    # intersection with exceptions list
    exc = [value for value in exceptions if value in lst]
    if exc:
        for e in exc:
            sl = sl.replace('{}{}'.format(e,sep), '{} '.format(e))

que opina?

omazapa commented 4 months ago

esas excepciones estaban antes para cuando se estaban partiendo primero apellidos y luego nombres.

restrepo commented 4 months ago

Hola Añadir Bob a lista de excepciones de palabras de 3 letras que tiene el algoritmo. Pues existen muchas palabras de hasta 3 letras que no son nombres. Saludos Diego

On Mon, Jun 3, 2024 at 4:36 PM Omar Zapata @.***> wrote:

esas excepciones estaban antes para cuando se estaban partiendo primero apellidos y luego nombres.

— Reply to this email directly, view it on GitHub https://github.com/colav/Kahi_plugins/issues/180#issuecomment-2146166747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAEC4A7LTQ3KWGBWTP5LLZFTOWFAVCNFSM6AAAAABCFJPCFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGE3DMNZUG4 . You are receiving this because you were mentioned.Message ID: @.***>

--

"La información aquí contenida es para uso exclusivo de la persona o entidad de destino. Está estrictamente prohibida su utilización, copia, descarga, distribución, modificación y/o reproducción total o parcial, sin el permiso expreso de Universidad de Antioquia, pues su contenido puede ser de carácter confidencial y/o contener material privilegiado. Si usted recibió esta información por error, por favor contacte en forma inmediata a quien la envió y borre este material de su computador. Universidad de Antioquia no es responsable por la información contenida en esta comunicación, el directo responsable es quien la firma o el autor de la misma."

-- UdeA

restrepo commented 4 months ago
from re import sub, split, UNICODE, search, match, findall

def split_names(s, connectors=['DE', 'DEL', 'LA', 'EL', 'JR', 'JR.'], sep=':'):
    """
    Extract the parts of the full name `s` in the format ([] → optional):

    [SMALL_CONECTORS] FIRST_LAST_NAME [SMALL_CONECTORS] [SECOND_LAST_NAME] NAMES

    * If len(s) == 2 → Foreign name assumed with single last name on it
    * If len(s) == 3 → Colombian name assumed two last mames and one first name

    Add connectors like to `connectors` list if necessary

    Works with:
    ----
          'DANIEL ANDRES LA ROTTA FORERO',
          'MARIA DEL CONSUELO MONTES RAMIREZ',
          'RICARDO DE LA MERCED CALLEJAS POSADA',
          'MARIA DEL CARMEN DE LA CUESTA BENJUMEA',
          'CARLOS MARTI JARAMILLO OCAMPO NICOLAS',
          'DIEGO ALEJANDRO RESTREPO QUINTERO',
          'JAIRO HUMBERTO RESTREPO ZEA',
          'MARLEN JIMENEZ DEL RIO ',
          'SARA RESTREPO FERNÁNDEZ', # Colombian: NAME two LAST_NAMES
          'ENRICO NARDI', # Foreing
          'ANA ZEA',
          'SOL ANA DE ZEA GIL'
    Fails:
    ----
        s='RANGEL MARTINEZ VILLAL ANDRES MAURICIO' # more than 2 last names
        s='ROMANO ANTONIO ENEA' # Foreing → LAST_NAME NAMES

    Parameters:
    ----------
    s:str
        The full name to be processed.
    exceptions:list
        A list of short last names to be considered as exceptions.
    sep:str
        The separator to be used to split the names.

    Returns:
    -------
    dict
        A dictionary with the extracted parts of the full name.
    """
    s = s.title()
    connectors = [e.title() for e in connectors]
    sl = sub('(\s\w{1,3})\s', fr'\1{sep}', s, UNICODE)  # noqa: W605
    sl = sub('(\s\w{1,3}%s\w{1,3})\s' % sep, fr'\1{sep}', sl, UNICODE)  # noqa: W605
    sl = sub('^(\w{1,3})\s', fr'\1{sep}', sl, UNICODE)  # noqa: W605
    # Clean connectors
    # Extract short names list
    lst = [s for s in split(
        '(\w{1,3})%s' % sep, sl) if len(s) >= 1 and len(s) <= 3]  # noqa: W605
    # intersection with connectors list
    exc = [value for value in lst if value not in connectors]

    if exc:
        for e in exc:
            sl = sl.replace('{}{}'.format(e, sep), '{} '.format(e))

    sll = sl.split()

    if len(sll) == 2:
        sll = [sl.split()[0]] + [''] + [sl.split()[1]]

    if len(sll) == 3:
        sll = [sl.split()[0]] + [''] + sl.split()[1:]

    d = {'names': [x.replace(sep, ' ') for x in sll[:2] if x],
         'surenames': [x.replace(sep, ' ') for x in sll[2:] if x],
         }
    d['full_name'] = ' '.join(d['names'] + d['surenames'])
    d['initials'] = [x[0] + '.' for x in d['names']]

    return d

def test_split_names():
    assert split_names('DANIEL ANDRES LA ROTTA FORERO')['surenames'] == ['La Rotta', 'Forero']
    assert split_names('MARIA DEL CARMEN DE LA CUESTA BENJUMEA')['names'] == ['Maria', 'Del Carmen']
    assert split_names('MARIA DEL CARMEN DE LA CUESTA BENJUMEA')['surenames'] == ['De La Cuesta', 'Benjumea']
    assert split_names('CARLOS MARTI JARAMILLO OCAMPO NICOLAS')['surenames'] == ['Jaramillo', 'Ocampo', 'Nicolas']
    assert split_names('DIEGO ALEJANDRO RESTREPO QUINTERO')['surenames'] == ['Restrepo', 'Quintero']
    assert split_names('JAIRO HUMBERTO RESTREPO ZEA')['surenames'] == ['Restrepo', 'Zea']
    assert split_names('MARLEN JIMENEZ DEL RIO ')['surenames'] == ['Jimenez', 'Del Rio']
    assert split_names('SARA RESTREPO FERNÁNDEZ')['names'] == ['Sara']
    assert split_names('ENRICO NARDI')['surenames'] == ['Nardi']
    assert split_names('ANA ZEA')['names'] == ['Ana']
    assert split_names('ANA ZEA')['surenames'] == ['Zea']
    assert split_names('SOL ANA DE ZEA GIL')['names'] == ['Sol', 'Ana']
    assert split_names('SOL ANA DE ZEA GIL')['surenames'] == ['De Zea', 'Gil']

test_split_names()