DemocracyOS / bill-scraper

Bill scraper for feeding the DemocracyOS platform
8 stars 4 forks source link

Preserve article number #4

Open gvilarino opened 11 years ago

gvilarino commented 11 years ago

Right now we're scraping articles like this:

{
  "articulo": "</b> Proh&iacute;bese a los establecimientos educativos ",
  "_id": {
    "$oid": "520a8ec68be1e20000000008"
  }
}

Add a 'number' property to an article's JSON representation and preserve its value so we can display it in the app.

No, we can't rely on order to determine the article numbers.

cristiandouce commented 11 years ago

:+1:

Instead of number as field name, I would use something like order or... better yet enumeration. I'm not sure if mongoose reserves number key.

gvilarino commented 11 years ago

Good call @cristiandouce. Let's use articleNumber as that's how congressmen and all law refers to them.

That is, if there isn't a better name for it under the Akoma Ntoso schema. That should take priority.

cristiandouce commented 11 years ago

I was thinking something like this:

{
  ...
  "articles": [{
    "_id": { "$oid": "520a8ec68be1e20000000008" },
    "text": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "enumeration": "14",
    "sub_enumeration": "bis",
  }, ...]
}

But, I'm open to debate.

cristiandouce commented 11 years ago

Or even better:

articles order would provide the order in the series of articles. But we may also have a field named "header": "Artículo 39.b" inside articles array.

{
  ...
  "articles": [{
    "_id": { "$oid": "520a8ec68be1e20000000008" },
    "text": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "order": 1,
    "header": "Artículo 14 bis",
  }, {
    "_id": { "$oid": "520a8ec68be1e20000000008" },
    "text": "</b> Proh&iacute;bese a los establecimientos educativos ",
    "order": 0,
    "header": "Artículo 14",
  }, ...]
}

Instead of header we could use title. At this point it doesn't really matter much which one we use. But for me is important to keep the order to display independent from the rendering "title".

gvilarino commented 11 years ago

Good call again, @cristiandouce. I still think we should not keep "Artículo XX" as a value nonetheless. Maybe so the number (14, 14b, etc.) but not the word "Artículo" since we've seen sometimes it's "Artículo", others "Art." and so forth.

What do you think about:

order -> order within the series of articles (base 0) articleId -> "14", "14bis" etc.

?

cristiandouce commented 11 years ago

:+1:

cristiandouce commented 11 years ago

To sum up:

{
 ...
  "articles": [{
    "_id": { "$oid": "520a8ec68be1e20000000008" },
    "articleId": "14bis",
    "order": 14,
    "text": "Procúrese derogar la ley..."
  }, ... ],

It is still in debate if text should be text/plain, markdown formatted or pure html.

My bet is on markdown as I wouldn't like to save stuff like:

<p class="texto-articulo" id="articulo24"> Ténganse en cuenta los siguientes vehículos: </p>
<ul class="sarasa" id="joraca" style="this is full of props">
  <li id="more stuf" class="and moooorrreee">a. Autos</li>
  <li id="more stuf" class="and moooorrreee">b. Motos</li>
  ...
</ul>

And this is one example.

gvilarino commented 11 years ago

I'm ok with this

:shipit: