DemocracyOS / bill-scraper

Bill scraper for feeding the DemocracyOS platform
8 stars 4 forks source link

Regex discards important text #1

Open gvilarino opened 11 years ago

gvilarino commented 11 years ago

The scraper miss-detects stuff and discards important data. For instance, scraping Cedom's bill 400 yields the following result:

{
"sancion": "01/06/2000",
"publicacion": "BOCBA N� 989 del 21/07/2000",
"promulgacion": "De Hecho del 03/07/2000",
"_id": {
  "$oid": "520a8ec68be1e20000000002"
},
"articulos": [
  {
    "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "_id": {
      "$oid": "520a8ec68be1e20000000008"
    }
  },
  {
    "articulo": "</b> Ning&uacute;n alumno, con motivo de mora en el ",
    "_id": {
      "$oid": "520a8ec68be1e20000000007"
    }
  },
  {
    "articulo": " </b>los alumnos de los establecimientos citados ",
    "_id": {
      "$oid": "520a8ec68be1e20000000006"
    }
  },
  {
    "articulo": "</b> De verse configurados los extremos descriptos en ",
    "_id": {
      "$oid": "520a8ec68be1e20000000005"
    }
  },
  {
    "articulo": "</b> La Secretar&iacute;a de Educaci&oacute;n podr&aacute; ",
    "_id": {
      "$oid": "520a8ec68be1e20000000004"
    }
  },
  {
    "articulo": "</b> Comun&iacute;quese, etc</P>",
    "_id": {
      "$oid": "520a8ec68be1e20000000003"
    }
  }
],
"__v": 0
}

As you can see, the articles' text are not quite complete.

On other cases, like when part of the text contains double quotes (e.g.: "Some text"), all of the article's text up to that section is also discarded.

As a general rule, ALL text between two articles' titles should be included as part of the article.

ultraklon commented 11 years ago

Guido, can you check if this is solved? Data is dirty with HTML but at least is not lost, please check