ManoSeimas / manoseimas.lt

ManoSeimas.lt website source code.
http://manoseimas.lt/
GNU Affero General Public License v3.0
4 stars 3 forks source link

Scrape titles and URLs of Pagrindinio komiteto išvados #113

Closed laureenas closed 9 years ago

laureenas commented 9 years ago

Scrape and save titles and source URL.

We'll be listing the titles and linking from the tile to the source document.

image

mgedmin commented 9 years ago

HTML sources have

<caption class="pav">PAGRINDINIO KOMITETO IŠVADA Transporto lengvatų įstatymo 5 straipsnio pakeitimo įstatymo projektui<br><br></caption>

which corresponds to ekrano nuotrauka is 2015-10-27 14-29-12 and is easy to parse.

Unfortunately this title doesn't necessarily match the one in the original bug description, which for the above document is ekrano nuotrauka is 2015-10-27 14-30-13

and is produced by terrible HTML like

<p class=Komitetas>LIETUVOS RESPUBLIKOS SEIMO</p>

<p class=Komitetas><span style='text-transform:uppercase'>SOCIALINIŲ REIKALŲ IR
DARBO komitetas </span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>&nbsp;</span></p>

<h2 style='line-height:115%'>PAGRINDINIO KOMITETO</h2>

<h2 style='line-height:normal'>I&nbsp;Š&nbsp;V&nbsp;A&nbsp;D&nbsp;O&nbsp;S</h2>

<p class=MsoNormal style='line-height:150%'><span style='font-family:"Times New Roman","serif"'>&nbsp;</span></p>

<p class=MsoNormal align=center style='margin-top:0cm;margin-right:7.05pt;
margin-bottom:0cm;margin-left:14.2pt;margin-bottom:.0001pt;text-align:center;
line-height:115%'><b>DĖL </b><b><span style='font-family:"Times New Roman","serif"'>LIETUVOS
RESPUBLIKOS TRANSPORTO LENGVATŲ ĮSTATYMO 5 STRAIPSNIO PAKEITIMO </span></b></p>

<p class=Projektas style='line-height:115%'>ĮSTATYMO PROJEKTO (Nr. XIP-3410)</p>

<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-family:"Times New Roman","serif"'>&nbsp;</span></b></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>2013 m. spalio 16 d. 103-P-40</span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>Vilnius</span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>&nbsp;</span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>&nbsp;</span></p>

<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>&nbsp;</span></p>

Note how we can't rely on <p class="Projektas"> because it covers only the 2nd half of the title.

I'm going to parse the <caption>.