dijs / infobox-parser

Parse Wikipedia Infoboxes
40 stars 18 forks source link

Website and some other attributes get corrupted sometimes #11

Closed roelvanhintum closed 6 years ago

roelvanhintum commented 6 years ago

This seems like a good one for the test cases. I'm trying to debug it myself, but no luck so far. https://en.wikipedia.org/wiki/Cedar_Bluff_State_Park

dijs commented 6 years ago

I will check it out, thanks.

dijs commented 6 years ago

Hmmm... I just started to make a test spec for it, and I am seeing no errors. The properties are parsed just fine.

Here they are:

{ name: 'Cedar Bluff State Park',
  category: 'List of Kansas state parks',
  image: 'Cedarblufflimestone.jpg',
  imageCaption: 'Limestone on the edge of Cedar Bluff',
  imageSize: '280',
  country: 'United States',
  state: 'Kansas',
  regionType: 'County',
  region: 'Trego County, Kansas',
  elevationImperial: '2185',
  elevationRound: '0',
  coordinates: '38|48|41|N|99|43|57|W|region:US-KS_dim:3000,title',
  areaUnit: 'acre',
  areaImperial: '850',
  areaRound: '1',
  establishedType: 'Established',
  established: '1962',
  managementBody: 'Kansas Department of Wildlife, Parks and Tourism',
  mapLocator: 'Kansas',
  map: 'Kansas Locator Map.PNG',
  mapCaption: 'Location in Kansas',
  mapSize: '280',
  website: [ 'Trego County, Kansas', 'Kansas', 'United States' ],
  refs:
   [ [ '[http://ksoutdoors.com/State-Parks/Locations/Cedar-Bluff Cedar Bluff State Park] Kansas Department of Wildlife, Parks and Tourism',
       '[http://ksoutdoors.com/State-Parks/Locations/Cedar-Bluff/Cedar-Bluff-Gallery/Cedar-Bluff-Reservoir-Map Cedar Bluff Reservoir Map] Kansas Department of Wildlife, Parks and Tourism\n\n\n{{Protected Areas of Kansas}}\n\nCategory:State parks of Kansas\nCategory:Protected areas of Trego County, Kansas' ] ] }
dijs commented 6 years ago

Are you using the latest version 2.1.0?

roelvanhintum commented 6 years ago

Yes, i'm using the latest version. Looking at the raw data and doing some testing, it looks like the regex is mismatching. The website attribute has a completely different value than the original data.

{{Geobox|Protected area
| name = Cedar Bluff State Park
| category = [[List of Kansas state parks|Kansas State Park]]
| image = Cedarblufflimestone.jpg
| image_caption = Limestone on the edge of Cedar Bluff
| image_size = 280
| country = {{flag|United States}}
| state = {{flag|Kansas}}
| region_type = County
| region = Trego County, Kansas
| location = 
| elevation_imperial = 2185
| elevation_round = 0
| elevation_note = <ref name=gnis>{{cite gnis|2625190|Cedar Bluff State Park Office}}</ref>
| coordinates = {{coord|38|48|41|N|99|43|57|W|region:US-KS_dim:3000|display=inline,title}}
| coordinates_note = <ref name=gnis/>
| highest_coordinates = 
| lowest_coordinates = 
| management_coordinates = 
| government_coordinates = 
| area_unit = acre
| area_imperial = 850
| area_round = 1
| established_type = Established
| established = 1962
| management_body = Kansas Department of Wildlife, Parks and Tourism
| map_locator = Kansas
| map = Kansas Locator Map.PNG | <!-- for valid images, see Template:Geobox locator Kansas -->
| map_caption = Location in Kansas
| map_size = 280
| website = [http://ksoutdoors.com/State-Parks/Locations/Cedar-Bluff Cedar Bluff State Park]
}}
dijs commented 6 years ago

Oh yeah, that website value is very incorrect... haha. I completely missed that

roelvanhintum commented 6 years ago

Haha nice. I did do some regex tests in http://rubular.com/ (gives some visual feedback), and i couldn't figure out how to fix it, but it does match on some weird points. It also looks like more than just the infobox part is passed through, is the rest of the content relevant for the infobox?

dijs commented 6 years ago

No it is not. I am actually checking right now if wikijs ever passes in more than the infobox source.

If not, some of these tests are working too hard.

dijs commented 6 years ago

Okay. We should be good now.

Big refactor. Check out the changes in v4.5.0 of wikijs, unless you are directly using this library, which the new version is v2.2.1