derekantrican / MountainProject

A scraper and reddit bot for the website MountainProject.com
57 stars 5 forks source link

[DBBuilder] Clean up area names #38

Closed derekantrican closed 5 years ago

derekantrican commented 5 years ago

We get the name of areas from the "left side panel" of the parent area. The reason we do this is because the area name (in the large header) often contains the words "_ Rock Climbing" or "___ Climbing". The left side panel does not contain these words. But sometimes it can contain other random strings. For instance, on this page the areas are listed with the prefixes "A:", "B:", etc. We should either try to clean up these strings from the left side panel or get the header string and remove the "Rock Climbing", "Climbing", or etc strings.

Here's another example of random strings in area names And here's another one (note the "14 -" in the header)

derekantrican commented 5 years ago

When doing this, I should keep the old xml file and compare to the new one before committing

derekantrican commented 5 years ago

We should also remove "Area" from the end of area names. For instance: Bishop Area

derekantrican commented 5 years ago

Reopening this because this location "Red Rock" is listed in the header as "Red Rock Climbing". Because of our Regex's, we're shortening this to "Red".

Instead, we should get the name from somewhere else. For instance, here:

image

We should also investigate (again) if we can use a Regex such as ^\d+\s?-\s? to remove things such as "05-" from the beginning of an area name