biglocalnews / warn-scraper

Command-line interface for downloading WARN Act notices of qualified plant closings and mass layoffs from state government websites
https://warn-scraper.readthedocs.io
Apache License 2.0
29 stars 10 forks source link

LA -- multi-line mess-up #533

Closed stucka closed 1 year ago

stucka commented 1 year ago

Louisiana's scraper is confused by an entry split into single lines.

The error is from https://www.laworks.net/Downloads/WFD/WarnNotices2023.pdf

AMETEK/Orion Instruments,"2015 Oak Villa Blvd. Baton Rouge, LA 70815",,4/10/23 (Updated 7/12/23),9/15/23,20,Manufacture> ,,,,10/2/23,44, ,,,,11/15/23,9,

I'm trying to at least get it functional with a patch in warn-transform: https://github.com/biglocalnews/warn-transformer/releases/tag/1.3.54

stucka commented 1 year ago

Possible approach in this section:

                for index, row in enumerate(table.rows):
                    cells = row.cells
                    row = [_extract_cell_chars(page, cell) for cell in cells]

                    # If the first row in a table is mostly empty,
                    # append its contents to the previous row
                    if (
                        _is_first(index)
                        and _is_mostly_empty(row)
                        and _has_rows(output_rows)
                    ):
                        output_rows = _append_contents_to_cells_in_row_above(
                            output_rows, index, row
                        )
                    # Otherwise, append the row
                    else:
                        output_rows.append(row)

If a row _is_mostly_empty and the first cell is blank (company name) append to the last good row. Last good row to be defined as the last row not _is_mostly_empty.

stucka commented 1 year ago

Louisiana scraper is disabled in warn-transformer==1.3.55 until we get this fixed.

https://github.com/biglocalnews/warn-transformer/commit/a348e5506be4dc6830ff30584a7c093f38b0b244

zstumgoren commented 1 year ago

@stucka Looks like there are subsequent rows with no metadata other than date and layoff count which are associated with a prior record.

Sounds like you're hoping to consolidate those rows into a single row. That could work, though I think it would also be fine to keep these as three separate rows and just sprinkle on the metadata from the "parent" row to the subsequent rows containing only layoff date and count values.

Either way, since the source of data is a PDF to begin with, correcting those records in the CSV generated during scraping stage seems reasonable.

stucka commented 1 year ago

@zstumgoren , I think your approach is easier to implement than what I was pictured, and keeps any nuance offered by that state. Thank you.

I think the approach I was considering would be likely error out on most applications, but maybe that review wouldn't be terrible, e.g., number of layoffs of 20449 would trigger a different transform error requiring human review. I'll plan to go with your approach.

stucka commented 1 year ago

Fixed with https://github.com/biglocalnews/warn-scraper/releases/tag/1.2.36