DemocracyClub / yournextrepresentative

👥 A website for crowd-sourcing structured election candidate data
https://candidates.democracyclub.org.uk
GNU Affero General Public License v3.0
21 stars 27 forks source link

SOPN Parsing: Page Extraction Errors #1726

Open VirginiaDooley opened 2 years ago

VirginiaDooley commented 2 years ago

This issue is exclusively to track issues with SOPN Page Extraction. For SOPN Parsing: Table Parsing Errors, go here: https://github.com/DemocracyClub/yournextrepresentative/issues/1728 For SOPN Parsing: Table Extraction Errors, go here: https://github.com/DemocracyClub/yournextrepresentative/issues/1727

Page extraction errors are typically when trying to upload a SOPN upload. Most common errors include:

Please add these types of issues in the comments below with a

VirginiaDooley commented 2 years ago

Page matching error: https://github.com/DemocracyClub/yournextrepresentative/issues/1426#issuecomment-1025952251

symroe commented 2 years ago

https://candidates.democracyclub.org.uk/elections/local.west-lothian.livingston-south.2022-05-05/sopn/ (and other SOPNs for that election) don't match pages. Chances are this is because the ward names are in the table header.

boothym commented 2 years ago

Hi, it seems the Fife Council one has problems as each table is spread over two pages in the PDF. https://candidates.democracyclub.org.uk/elections/local.fife.burntisland-kinghorn-and-western-kirkcaldy.2022-05-05/sopn/

jf1 commented 2 years ago

Wigan strangeness - the correct pages have been used by the parser for all the LA (so far) but the link in the Ashton ward goes to another ward's SoPN https://candidates.democracyclub.org.uk/elections/local.wigan.ashton.2022-05-05/sopn/

jf1 commented 2 years ago

It's also joined the Hindley and Hindley Green wards, suggesting it's not strict enough when considering if a ward stretches onto two pages of a SoPN. https://candidates.democracyclub.org.uk/bulk_adding/sopn/local.wigan.hindley.2022-05-05/?edit=1 I wonder if page splitting was offset by one as a result.

...it then processed the Hindley Green page (again) for that ward without issue

jf1 commented 2 years ago

Wigan Winstanley ward - it offered the wrong candidate names and linked to the wrong (page of the) SoPN https://candidates.democracyclub.org.uk/bulk_adding/sopn/local.wigan.winstanley.2022-05-05/

gregorywilliams commented 2 years ago

https://candidates.democracyclub.org.uk/elections/local.oxford.cowley.2022-05-05/ Should have been Cowley ward, but extracted page was for Littlemore ward. The correct ward is available in the linked https://www.oxford.gov.uk/download/downloads/id/7948/statement_as_to_persons_nominated_-_city_elections_on_5_may_2022.pdf

gregorywilliams commented 2 years ago

https://candidates.democracyclub.org.uk/elections/local.wigan.ashton.2022-05-05/ Should have been Ashton ward, but extracted page was for Bryn ward. The correct ward is available in the linked https://www.wigan.gov.uk/Docs/PDF/Council/Voting-and-Elections/2022/Statement-of-Persons-Nominated-for-Local-Election-5-May-2022.pdf

jf1 commented 2 years ago

This 4-page single ward PDF incorrectly generated a "Watch out! The original document contains candidate info for 2 areas." warning https://candidates.democracyclub.org.uk/elections/local.tower-hamlets.bethnal-green-west.2022-05-05/sopn/

jf1 commented 2 years ago

Same with https://candidates.democracyclub.org.uk/elections/local.tower-hamlets.bethnal-green-east.2022-05-05/sopn/ Both were .docx on their website and initially DC had PDFs with different formatting so I re-did these two, and got the "2 areas" message after uploading each one.

sjorford commented 1 year ago

local.lichfield.boney-hay-central.2023-05-04 - the pages for Boney Hay & Central and Bourne Vale wards have been combined

it3986 commented 1 year ago

Exeter SOPNs don't appear to have been parsed by the bot - I've looked at the first 3 so far. https://candidates.democracyclub.org.uk/elections/local.exeter.alphington.2023-05-04/

image
it3986 commented 1 year ago

DocX file for Torbay Council doesn't appear to have been understood by the bot. Again I've checked the first 3 wards and they all have the same symptoms. Pages are matched but tables not extracted and no bot suggestions on the bulk adding screen.

https://candidates.democracyclub.org.uk/bulk_adding/sopn/local.torbay.churston-with-galmpton.2023-05-04/

[Edit] Later Wards within this SOPN document have not been page matched by the bot and required manual (Ctrl + F) Searching to even find the correct page of the SOPN to manually add the candidates.

image
Bekabyx commented 1 year ago

Parser fail for Mapperley in Nottingham. Haven't checked the other wards yet but it seems to have picked up the wrong page when parsing.

Screenshot 2023-04-06 at 22 53 26 Screenshot 2023-04-06 at 22 53 46 Screenshot 2023-04-06 at 22 53 54
VirginiaDooley commented 1 year ago

2023 Examples of the SOPN parser pulling the wrong ward from a combined PDF/DOCX file: https://candidates.democracyclub.org.uk/elections/local.vale-of-white-horse.sutton-courtenay.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.redcar-and-cleveland.guisborough.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.wyre.garstang.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.ribble-valley.ribchester.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.bedford.shortstown.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.chelmsford.moulsham-lodge.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.nottingham.mapperley.2023-05-04/sopn/ https://candidates.democracyclub.org.uk/elections/local.gateshead.2023-05-04/ https://candidates.democracyclub.org.uk/elections/local.bradford.2023-05-04/

VirginiaDooley commented 1 year ago

Sandwell St. Paul’s is in a limbo half-broken state. The page extraction failed but the table parsing succeeded (albeit in a slightly janky format). The SOPN uploaded is the entire combined PDF file. The suspect for this strange breakage was the backtick in the ward name although Virginia has checked this out and can’t see a problem with it. https://candidates.democracyclub.org.uk/elections/local.sandwell.st-pauls.2023-05-04/sopn/