Closed ychoy closed 5 years ago
Got 60 records from the spreadsheet, researched and found additional information from the SJC website, and added it the spreadsheet.
@amygcho will work on scraping and parsing data about City sponsored public art from http://sanjoseca.gov/facilities, and adding data to art.js
@JMStudiosJoe @amygcho added start code on branch
https://github.com/codeforsanjose/heartofthevalley/commit/79bd87bef613c053b493755c22e79fcba5034262
@ychoy @JMStudiosJoe
The scraper is working and is able to write to a file within the project. The flow of the scraper is such:
http://sanjoseca.gov/Facilities/Facility/Search
, with the complete query being something like http://sanjoseca.gov/Facilities/Facility/Search?featureIDs=&categoryIDs=15&occupants=null&keywords=&pageSize=100&pageNumber=1&sortBy=3¤tLatitude=null¤tLongitude=null&isReservableOnly=false
. This returns HTML listing artworks.The cleanup is not so straightforward because the HTML is inconsistent from individual page to page.
<div class="editorContent">
<font class="Subhead1">1737 Trees</font>
<font class="Subhead2">
Artist: Angela Buenning Filo<br>
<font class="Normal">2006</font><br>
</font>
</div>
<div class="editorContent">
<div class="Normal" style="text-align: left;">
<font class="Subhead1">
8 Minutes<em><br></em>
<font class="Subhead2">
Artists: Merge Conceptual Design (Franka Diehnelt and Claudia Reisenberger)
</font><br>
</font>
2013
</div>
I'll think some more on how to grab this data without too much hassle.
Btw, how were the geolocation computed in art.js
?
For the data being scraped, is the idea to use the postal address to determine the lat and long coordinates?
@ychoy @JMStudiosJoe ...
@ctram awesome job. I would assume taking the postal address and converting it to lon/lat. from what I could tell every public art link title stated with Public Art: and looks to be the same with Artist? Please let me know if need more help on this and I’ll do what I can.
Sent with GitHawk
@JMStudiosJoe I am able to save the address but having issues coming up with neatly getting the details (artist, title, etc) under their proper labels because of the issue with inconsistent HTML structure. Please take a look if you have time. I'll scrape all these 200 or so pages later, if the number of exceptions are reasonable, it might be worth it to just manually clean up the oddballs.
I'm currently writing the data as JSON, so a future task is to inject that data into the map.
Where did the current data come from, how did it get into JS object format in art.js
file?
@ctram Current data came from a spreadsheet (outdated) and @ychoy manually entering in data. I have not gone into the art.js file been mainly going after that webscraper.
@JMStudiosJoe @ychoy To check, am I OK to use the MapBox API key to generate the geolocation based on postal address?
@JMStudiosJoe @ychoy I believe we can make X amount of API requests per month before they start charging someone's card? : ]
@ctram yes MapBox API should be good to use and this won't be making that many requests per month
@ctram , thanks for working on the scraper! Once you start on inputting the data from the scrape into art.js, there may be duplicate information - I think we got about 60 records from the City's website into art.js. It's okay to overwrite what I have and just take the information you get from the City's website.
For geocoding lat and long - we've been trying to use everything Open Street Maps for this project. Maybe consider using Nominatim-Browser https://www.npmjs.com/package/nominatim-browser? It won't be entirely accurate because sometimes the position of the public art/mural will not be at the lat and long of the postal address. But until all of this information is inputted into OSM and able to be queried, this will work for now.
This is the general format of each JS object in art.js. We have a separate issue of the art.js needing to be cleaned up (because I injected a lot HTML tags, since some pieces have multiple artists and thus multiple websites, etc.). So I propose that we add additional attributes to look out for. sourceOfInformation would be the City of San Jose Public Art Program and the sourceURL is the specific webpage with the details about the public art/mural piece.. If the information exists regarding artist website, include it.
"geometry": {
"type": "Point",
"coordinates": [
]
},
"properties": {
"title": "",
**"artist1": "",
"artist2": "",
"artist3": "",
"artist1website": "",
"artist2website": "",
"artist3website": "",**
"description": "",
"**sourceOfInformation": "",
"sourceURL": "",**
"address": "",
"city": "",
"country": "",
"postalCode": "",
"state": ""
}
}
I realized I hadn't updated the API key. I have a key from CFA, which should allow for more API requests each month. I'll update it today.
Let me know if you have any more questions.
@ychoy Thanks! To be clear, the art.js
data came from the city website, but did not come from http://sanjoseca.gov/Facilities
, is that correct?
Yes, I will be working to consolidate all the data into a single JSON file.
I have Nominatim up and running, thanks for the suggestion!
@ychoy @JMStudiosJoe might you know how to get JSON data to the client without necessitating a call to a server? I am saving the scraped data as JSON; I'm not familiar with how to include JSON data with the index.html
file download; for example, would you include a <script>
tag with a source to the JSON?
@ctram likely we will have a frontend site such as react or angular that will serve the file as needed, at least that would be apart of the plan
There's a collection of SJC sponsored public art (about 200+) that we need to include in the map.