Closed sgershuny closed 2 months ago
Hi @sgershuny, thanks for the example file, that’s very helpful!
I had a look at the output from the spaCy model for the example you provided (see table below). Here’s a list of extracted entities. I’m not sure we can do something about this, as spaCy doesn‘t recognize "Inés Santamaría" as a PER
entity.
If all of your documents have the same structure, I would recommend manually parsing the relevant names from the documents and creating FollowTheMoney entities for them. For example, it should be relatively easy to extract the names after "Encargado del Proyecto" or "Nombre". The advantage of this approach is that it would be much more reliable than any NER model will ever be. You could even link the extracted entities to the source documents, so you can navigate from documents to extracted entities and vice versa.
Label | Text |
---|---|
ORG | Acciona S.A. - Formulario de Cumplimiento Ambiental Proyecto Denominado |
PER | Construcción de Parque Eólico |
LOC | Montaña Ubicación del Terreno |
LOC | Sierra de Guadarrama |
LOC | Madrid Descripción del Iniciativa: Instalación |
MISC | Entidad Solicitante |
PER | Energía Renovable S.A. Encargado |
MISC | Proyecto: Inés Santamaría Dirección Postal |
MISC | Calle Falsa 123 |
PER | Madrid Teléfono de Contacto |
MISC | Correo Electrónico de Referencia: juan.perez@energia-renovable.com Medio: Biodiversidad Consecuencia: Alteración Remedio |
PER | Reforestación Supervisión |
MISC | Monitoreo Documentos Adicionales: Adjuntos Proclamación |
MISC | Firma del Responsable: Firma: Inés Santamaría Nombre |
MISC | Inés Santamaría Fecha |
LOC | Depto |
PER | Ambiental Acciona S.A. Calle de Anabel Segura |
LOC | Alcobendas |
LOC | Madrid |
PER | España Consulta |
PER | Teléfono |
Note to myself: We do some text normalization before passing the text to spaCy. This includes collapsing whitespace. Not sure if that makes a difference.
I quickly tried out normalizing white space differently, and while it now does recognize "Inés Santamaría" as a name, the results overall are still quite mixed, and I’m unsure whether this would indeed lead to better consistent results. Also, this might reduce accuracy for prose, so I’m hesitant to move forward with this.
As mentioned above, for structured data I’d still recommend manually parsing out the data.
Label | Text |
---|---|
ORG | Acciona S.A. - Formulario de Cumplimiento Ambiental Proyecto Denominado |
PER | Construcción de Parque Eólico |
LOC | Montaña Ubicación del Terreno |
LOC | Sierra de Guadarrama |
LOC | Madrid |
MISC | Descripción del Iniciativa: Instalación |
MISC | Entidad Solicitante |
PER | Energía Renovable S.A. Encargado del |
PER | Inés Santamaría |
PER | Dirección Postal |
MISC | Calle Falsa 123 |
LOC | Madrid |
PER | Teléfono de Contacto |
MISC | Correo Electrónico de Referencia: juan.perez@energia-renovable.com Medio: Biodiversidad Consecuencia: Alteración Remedio |
LOC | Reforestación Supervisión |
LOC | Monitoreo Documentos Adicionales |
LOC | Adjuntos Proclamación |
MISC | Firma del Responsable: Firma: Inés Santamaría Nombre |
PER | Inés Santamaría Fecha |
MISC | Enviar: medioambiente@acciona.com Dirección |
LOC | Depto |
LOC | Ambiental Acciona S.A. Calle de Anabel Segura |
LOC | Alcobendas |
LOC | Madrid |
LOC | España Consulta |
PER | Teléfono |
MISC | Email: medioambiente@acciona.com |
I’m going to close this given it’s an issue with spaCy, but please feel to reopen in case you have additional information.
Describe the bug Names/entities are not detected as mentions for html documents I am uploading. When I run spaCy locally it doesn't detect the names in my documents. I am parsing the HTML as I see it parsed in html.py on ingest GitHub and using the es_core_news_sm nlp model. These documents are Spanish language and they're also not structured in full sentences. The library flair with ner-spanish-large language model does work in extracting these names.
To Reproduce I have created a fake document with the name: Inés Santamaría and it is not detected using spacy. Apologies that we can't provide our real data as that would surely be more helpful.
I saved in txt form as I couldn't upload html here fake_form.txt
Expected behavior Inés Santamaría to be extracted as a "PER" entity from spaCy as it is from flair.
Aleph version 3.17.0
Additional context Consistently misses detection of names on every single doc. These names usually aren't detected by spaCy on local run.