Closed laureenas closed 9 years ago
HTML sources have
<caption class="pav">PAGRINDINIO KOMITETO IŠVADA Transporto lengvatų įstatymo 5 straipsnio pakeitimo įstatymo projektui<br><br></caption>
which corresponds to and is easy to parse.
Unfortunately this title doesn't necessarily match the one in the original bug description, which for the above document is
and is produced by terrible HTML like
<p class=Komitetas>LIETUVOS RESPUBLIKOS SEIMO</p>
<p class=Komitetas><span style='text-transform:uppercase'>SOCIALINIŲ REIKALŲ IR
DARBO komitetas </span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'> </span></p>
<h2 style='line-height:115%'>PAGRINDINIO KOMITETO</h2>
<h2 style='line-height:normal'>I Š V A D O S</h2>
<p class=MsoNormal style='line-height:150%'><span style='font-family:"Times New Roman","serif"'> </span></p>
<p class=MsoNormal align=center style='margin-top:0cm;margin-right:7.05pt;
margin-bottom:0cm;margin-left:14.2pt;margin-bottom:.0001pt;text-align:center;
line-height:115%'><b>DĖL </b><b><span style='font-family:"Times New Roman","serif"'>LIETUVOS
RESPUBLIKOS TRANSPORTO LENGVATŲ ĮSTATYMO 5 STRAIPSNIO PAKEITIMO </span></b></p>
<p class=Projektas style='line-height:115%'>ĮSTATYMO PROJEKTO (Nr. XIP-3410)</p>
<p class=MsoNormal align=center style='text-align:center'><b><span
style='font-family:"Times New Roman","serif"'> </span></b></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>2013 m. spalio 16 d. 103-P-40</span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'>Vilnius</span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'> </span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'> </span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-family:"Times New Roman","serif"'> </span></p>
Note how we can't rely on <p class="Projektas">
because it covers only the 2nd half of the title.
I'm going to parse the <caption>
.
Scrape and save titles and source URL.
We'll be listing the titles and linking from the tile to the source document.