Code4HR / open-health-inspection-scraper

Scraper for the open-health-inspector app.
Apache License 2.0
7 stars 9 forks source link

Need better parsing of violations #6

Closed ttavenner closed 10 years ago

ttavenner commented 10 years ago

Each violation consists of four parts:

  1. the code violated,
  2. whether it was a core or priority item,
  3. a description of the violation
  4. a corrective action to be taken.

These are being correctly parsed into separate components, however there is no standard as to what index each elements ends up in. i.e. it could be [0] code [1] [2] core [3] description [4] action

or

[0] [1] code [2] core [3] description [4] action

etc. any combination is possible. It would be easier to parse if we could standardize/label the indexes. Looking at the HTML, this could be done by identifying elements.

core/priority is always in a red font tag with bold tags, the corrective action is always in a green font tag with italics, and the description sits between these with no markup. This could also prevent the table header from being included as its own violation, which is currently happening.

bschoenfeld commented 10 years ago

:thumbsup: This is coming

On Mon, Mar 3, 2014 at 5:46 PM, ttavenner notifications@github.com wrote:

Each violation consists of four parts:

  1. the code violated,
  2. whether it was a core or priority item,
  3. a description of the violation
  4. a corrective action to be taken.

These are being correctly parsed into separate components, however there is no standard as to what index each elements ends up in. i.e. it could be [0] code [1] [2] core [3] description [4] action

or

[0] [1] code [2] core [3] description [4] action

etc. any combination is possible. It would be easier to parse if we could standardize/label the indexes. Looking at the HTML, this could be done by identifying elements.

core/priority is always in a red font tag with bold tags, the corrective action is always in a green font tag with italics, and the description sits between these with no markup. This could also prevent the table header from being included as its own violation, which is currently happening.

Reply to this email directly or view it on GitHubhttps://github.com/c4hrva/open-health-inspection-scraper/issues/6 .