alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.01k stars 270 forks source link

BUG: Names not extracted as mentions #3785

Closed sgershuny closed 2 months ago

sgershuny commented 3 months ago

Describe the bug Names/entities are not detected as mentions for html documents I am uploading. When I run spaCy locally it doesn't detect the names in my documents. I am parsing the HTML as I see it parsed in html.py on ingest GitHub and using the es_core_news_sm nlp model. These documents are Spanish language and they're also not structured in full sentences. The library flair with ner-spanish-large language model does work in extracting these names.

To Reproduce I have created a fake document with the name: Inés Santamaría and it is not detected using spacy. Apologies that we can't provide our real data as that would surely be more helpful.

I saved in txt form as I couldn't upload html here fake_form.txt

Expected behavior Inés Santamaría to be extracted as a "PER" entity from spaCy as it is from flair.

Aleph version 3.17.0

Additional context Consistently misses detection of names on every single doc. These names usually aren't detected by spaCy on local run.

tillprochaska commented 3 months ago

Hi @sgershuny, thanks for the example file, that’s very helpful!

I had a look at the output from the spaCy model for the example you provided (see table below). Here’s a list of extracted entities. I’m not sure we can do something about this, as spaCy doesn‘t recognize "Inés Santamaría" as a PER entity.

If all of your documents have the same structure, I would recommend manually parsing the relevant names from the documents and creating FollowTheMoney entities for them. For example, it should be relatively easy to extract the names after "Encargado del Proyecto" or "Nombre". The advantage of this approach is that it would be much more reliable than any NER model will ever be. You could even link the extracted entities to the source documents, so you can navigate from documents to extracted entities and vice versa.

Label Text
ORG Acciona S.A. - Formulario de Cumplimiento Ambiental Proyecto Denominado
PER Construcción de Parque Eólico
LOC Montaña Ubicación del Terreno
LOC Sierra de Guadarrama
LOC Madrid Descripción del Iniciativa: Instalación
MISC Entidad Solicitante
PER Energía Renovable S.A. Encargado
MISC Proyecto: Inés Santamaría Dirección Postal
MISC Calle Falsa 123
PER Madrid Teléfono de Contacto
MISC Correo Electrónico de Referencia: juan.perez@energia-renovable.com Medio: Biodiversidad Consecuencia: Alteración Remedio
PER Reforestación Supervisión
MISC Monitoreo Documentos Adicionales: Adjuntos Proclamación
MISC Firma del Responsable: Firma: Inés Santamaría Nombre
MISC Inés Santamaría Fecha
LOC Depto
PER Ambiental Acciona S.A. Calle de Anabel Segura
LOC Alcobendas
LOC Madrid
PER España Consulta
PER Teléfono
tillprochaska commented 3 months ago

Note to myself: We do some text normalization before passing the text to spaCy. This includes collapsing whitespace. Not sure if that makes a difference.

tillprochaska commented 3 months ago

I quickly tried out normalizing white space differently, and while it now does recognize "Inés Santamaría" as a name, the results overall are still quite mixed, and I’m unsure whether this would indeed lead to better consistent results. Also, this might reduce accuracy for prose, so I’m hesitant to move forward with this.

As mentioned above, for structured data I’d still recommend manually parsing out the data.

Label Text
ORG Acciona S.A. - Formulario de Cumplimiento Ambiental Proyecto Denominado
PER Construcción de Parque Eólico
LOC Montaña Ubicación del Terreno
LOC Sierra de Guadarrama
LOC Madrid
MISC Descripción del Iniciativa: Instalación
MISC Entidad Solicitante
PER Energía Renovable S.A. Encargado del
PER Inés Santamaría
PER Dirección Postal
MISC Calle Falsa 123
LOC Madrid
PER Teléfono de Contacto
MISC Correo Electrónico de Referencia: juan.perez@energia-renovable.com Medio: Biodiversidad Consecuencia: Alteración Remedio
LOC Reforestación Supervisión
LOC Monitoreo Documentos Adicionales
LOC Adjuntos Proclamación
MISC Firma del Responsable: Firma: Inés Santamaría Nombre
PER Inés Santamaría Fecha
MISC Enviar: medioambiente@acciona.com Dirección
LOC Depto
LOC Ambiental Acciona S.A. Calle de Anabel Segura
LOC Alcobendas
LOC Madrid
LOC España Consulta
PER Teléfono
MISC Email: medioambiente@acciona.com
tillprochaska commented 2 months ago

I’m going to close this given it’s an issue with spaCy, but please feel to reopen in case you have additional information.