codeforsanjose / heartofthevalley

Map visualization of murals and public art in South Bay (Bay Area, California)
https://codeforsanjose.github.io/heartofthevalley/
MIT License
13 stars 18 forks source link

Update data to include SJC Public Art #33

Closed ychoy closed 5 years ago

ychoy commented 6 years ago

There's a collection of SJC sponsored public art (about 200+) that we need to include in the map.

ychoy commented 6 years ago

Got 60 records from the spreadsheet, researched and found additional information from the SJC website, and added it the spreadsheet.

ychoy commented 6 years ago

@amygcho will work on scraping and parsing data about City sponsored public art from http://sanjoseca.gov/facilities, and adding data to art.js

ctram commented 6 years ago

@JMStudiosJoe @amygcho added start code on branch

https://github.com/codeforsanjose/heartofthevalley/commit/79bd87bef613c053b493755c22e79fcba5034262

ctram commented 6 years ago

@ychoy @JMStudiosJoe

The scraper is working and is able to write to a file within the project. The flow of the scraper is such:

  1. Make POST to http://sanjoseca.gov/Facilities/Facility/Search, with the complete query being something like http://sanjoseca.gov/Facilities/Facility/Search?featureIDs=&categoryIDs=15&occupants=null&keywords=&pageSize=100&pageNumber=1&sortBy=3&currentLatitude=null&currentLongitude=null&isReservableOnly=false. This returns HTML listing artworks.
  2. Have scraper follow the link to each individual artwork.
  3. Have scraper set data to certain categories such as "artist", "description"...
  4. Iterate through the data and clean it up, formatting.
  5. Save to file.

The cleanup is not so straightforward because the HTML is inconsistent from individual page to page.

<div class="editorContent">
  <font class="Subhead1">1737 Trees</font>

  <font class="Subhead2">
     Artist: Angela Buenning Filo<br>
     <font class="Normal">2006</font><br>
  </font>
</div>
<div class="editorContent">
    <div class="Normal" style="text-align: left;">
        <font class="Subhead1">
            8 Minutes<em><br></em>
            <font class="Subhead2">
                  Artists: Merge Conceptual Design (Franka Diehnelt and Claudia Reisenberger) 
            </font><br>
       </font>
       2013
</div>

I'll think some more on how to grab this data without too much hassle.

Btw, how were the geolocation computed in art.js ?

For the data being scraped, is the idea to use the postal address to determine the lat and long coordinates?

JMStudiosJoe commented 6 years ago

@ychoy @JMStudiosJoe ...

@ctram awesome job. I would assume taking the postal address and converting it to lon/lat. from what I could tell every public art link title stated with Public Art: and looks to be the same with Artist? Please let me know if need more help on this and I’ll do what I can.

Sent with GitHawk

ctram commented 6 years ago

@JMStudiosJoe I am able to save the address but having issues coming up with neatly getting the details (artist, title, etc) under their proper labels because of the issue with inconsistent HTML structure. Please take a look if you have time. I'll scrape all these 200 or so pages later, if the number of exceptions are reasonable, it might be worth it to just manually clean up the oddballs.

I'm currently writing the data as JSON, so a future task is to inject that data into the map.

Where did the current data come from, how did it get into JS object format in art.js file?

JMStudiosJoe commented 6 years ago

@ctram Current data came from a spreadsheet (outdated) and @ychoy manually entering in data. I have not gone into the art.js file been mainly going after that webscraper.

ctram commented 6 years ago

@JMStudiosJoe @ychoy To check, am I OK to use the MapBox API key to generate the geolocation based on postal address?

ctram commented 6 years ago

@JMStudiosJoe @ychoy I believe we can make X amount of API requests per month before they start charging someone's card? : ]

JMStudiosJoe commented 6 years ago

@ctram yes MapBox API should be good to use and this won't be making that many requests per month

ychoy commented 6 years ago

@ctram , thanks for working on the scraper! Once you start on inputting the data from the scrape into art.js, there may be duplicate information - I think we got about 60 records from the City's website into art.js. It's okay to overwrite what I have and just take the information you get from the City's website.

For geocoding lat and long - we've been trying to use everything Open Street Maps for this project. Maybe consider using Nominatim-Browser https://www.npmjs.com/package/nominatim-browser? It won't be entirely accurate because sometimes the position of the public art/mural will not be at the lat and long of the postal address. But until all of this information is inputted into OSM and able to be queried, this will work for now.

This is the general format of each JS object in art.js. We have a separate issue of the art.js needing to be cleaned up (because I injected a lot HTML tags, since some pieces have multiple artists and thus multiple websites, etc.). So I propose that we add additional attributes to look out for. sourceOfInformation would be the City of San Jose Public Art Program and the sourceURL is the specific webpage with the details about the public art/mural piece.. If the information exists regarding artist website, include it.

                "geometry": {
                    "type": "Point",
                    "coordinates": [
                    ]
                },
                "properties": {
                    "title": "",
                    **"artist1": "",
                    "artist2": "",
                    "artist3": "",
                    "artist1website": "",
                    "artist2website": "",
                    "artist3website": "",**
                    "description": "",
                    "**sourceOfInformation": "", 
                    "sourceURL": "",** 
                    "address": "",
                    "city": "",
                    "country": "",
                    "postalCode": "",
                    "state": ""
                }
            } 

I realized I hadn't updated the API key. I have a key from CFA, which should allow for more API requests each month. I'll update it today.

Let me know if you have any more questions.

ctram commented 6 years ago

@ychoy Thanks! To be clear, the art.js data came from the city website, but did not come from http://sanjoseca.gov/Facilities, is that correct?

Yes, I will be working to consolidate all the data into a single JSON file.

I have Nominatim up and running, thanks for the suggestion!

@ychoy @JMStudiosJoe might you know how to get JSON data to the client without necessitating a call to a server? I am saving the scraped data as JSON; I'm not familiar with how to include JSON data with the index.html file download; for example, would you include a <script> tag with a source to the JSON?

JMStudiosJoe commented 6 years ago

@ctram likely we will have a frontend site such as react or angular that will serve the file as needed, at least that would be apart of the plan