DemocracyClub / yournextrepresentative

👥 A website for crowd-sourcing structured election candidate data
https://candidates.democracyclub.org.uk
GNU Affero General Public License v3.0
21 stars 27 forks source link

Improve sopn parsing party matching #2355

Closed symroe closed 1 month ago

symroe commented 1 month ago

This change does a couple of things:

  1. Makes independant matching a little better, meaning we should avoid matching "Independent Dickens Heath Residents Action Group" when the party is "Independant"
  2. Uses "Levenshtein" distance queries before "starts with" queries, to avoid matching longer descriptions that happen to start with the search text.

Hopefully this will prevent a couple of the most common problems with SOPN parsing

symroe commented 1 month ago

Just checking, did you run parse_tables on the extracted data? Was there a raw_data object in the database?

VirginiaDooley commented 1 month ago

There was not raw people data but I don't think I ran parse tables actually. I was referencing this workflow to test: https://github.com/DemocracyClub/yournextrepresentative/pull/2264#issuecomment-2009702008

In addition, I've now run python manage.py sopn_parsing_parse_tables.

"We couldn't find a header for local.runnymede.chertsey-st-anns.2022-05-05".

https://s3.eu-west-2.amazonaws.com/static-candidates.democracyclub.org.uk/media/official_documents/local.runnymede.chertsey-st-anns.2022-05-05/sopn/2024-04-05T13%3A57%3A45.088502%2B00%3A00/sopn-local.runnymede.chertsey-st-anns.2022-05-05.pdf

Parsing Status Pages matched: No Camelot tables extracted: Yes Raw Person Data: No AWS Textract Data: Yes AWS Textract Parsed? Yes

So perhaps not a great test case.

VirginiaDooley commented 1 month ago

Unrelated to this but related to SOPN, the click to add file feature only works with two attempts on my machine. Not a big deal but FYI in case it comes up later.