Closed matkoniecz closed 6 months ago
This one is a bit of a weird one, well a pain. The data is only in javascript. We have chompjs that can parse it, but it's not ideal, in this case it turns Latitude : 50.08325
into "Latitude ": 50.08325
(trailing space). We can work with it, or replace it, but it may cause us issues if there are :
s in the addresses, so it may be better for us to tolerate it... Your call.
Anyway, here is where I got to. Hope you can finish it off.
import re
from typing import Any
import chompjs
from scrapy import Spider
from scrapy.http import Response
from locations.items import Feature
class TestSpider(Spider):
name = "test"
item_attributes = {"brand": "KKM", "brand_wikidata": "Q57515549"}
start_urls = ["https://kkm.krakow.pl/pl/punkty-sprzedazy-biletow/"]
def parse(self, response: Response, **kwargs: Any) -> Any:
for location_raw in re.findall(r"items\.push\(\s+({.+?})\)", response.text, re.DOTALL):
location = chompjs.parse_js_object(re.sub(r"\s:", ":", location_raw))
item = Feature()
item["ref"] = location["Id"]
item["lat"] = location["Latitude"]
item["lon"] = location["Longitude"]
print(location) # TODO: address? TypeId/TypeName
yield item
Hello, chompjs author here.
Fixed this issue (https://github.com/Nykakin/chompjs/pull/57), released in 1.2.3 version.
>>> import chompjs
>>>
>>> chompjs.parse_js_object("{Latitude : 50.08325}")
{'Latitude': 50.08325}
Please don't hesitate to open issues if you encounter any problems with the library.
we are using old version for now, see https://github.com/alltheplaces/alltheplaces/blob/master/Pipfile.lock#L403
I guess it waits for dependabot to notice new release ( https://github.com/alltheplaces/alltheplaces/commits/master/Pipfile.lock ) ? Or maybe it should be upgraded manually....
https://github.com/alltheplaces/alltheplaces/pulls/app%2Fdependabot has no open issues
Feel free to update chompjs in the same PR. $ pipenv update chompjs
hmmm, I tried this - and it fails with
pipenv.patched.pip._vendor.packaging.specifiers.InvalidSpecifier: Invalid specifier: '22.10.0'
likely referring to twisted
https://gist.github.com/matkoniecz/cbb5f231bf6c27166b2fb026cc59beea has logs
According to https://stackoverflow.com/questions/52855310/pipenv-packaging-specifiers-invalidspecifier-invalid-specifier#52855311 this will happen when specifier is not matching expected format.
No idea what should be done if package actually has such version.
The upgrade to chompjs 1.2.3 happened in https://github.com/alltheplaces/alltheplaces/pull/7341.
Twisted is pinned because of https://github.com/alltheplaces/alltheplaces/pull/6430. I'll remove that pin because there's a new version of scrapy now.
If it's only some of the POIs, set a value, we typically use item["nsi_id"] = "N/A"
. If it's the entire spider, disable the pipeline:
custom_settings = {
"ITEM_PIPELINES": ITEM_PIPELINES | {"locations.pipelines.apply_nsi_categories.ApplyNSICategoriesPipeline": None}
}
Would it make sense to mention it at https://github.com/alltheplaces/alltheplaces/blob/master/docs/WIKIDATA.md ?
There is a tantalizing hint "In the cases where it fails for some reason then the automation can be disabled for the spider" that I found but was later stuck.
I'm gonna write some docs for all pipelines.
https://github.com/alltheplaces/alltheplaces/pull/7610 should have implement that
Brand name
KKM
Wikidata ID
Q57515549
Store finder url(s)
https://kkm.krakow.pl/pl/punkty-sprzedazy-biletow/
see blue one mapped already https://www.openstreetmap.org/node/1907768673
not in ATP: https://www.alltheplaces.xyz/map/#19.29/50.0816335/19.8819302
(I can trying adding spider)