alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
611 stars 205 forks source link

vending=public_transport_tickets in Kraków #6907

Closed matkoniecz closed 6 months ago

matkoniecz commented 8 months ago

Brand name

KKM

Wikidata ID

Q57515549

Store finder url(s)

https://kkm.krakow.pl/pl/punkty-sprzedazy-biletow/

see blue one mapped already https://www.openstreetmap.org/node/1907768673

not in ATP: https://www.alltheplaces.xyz/map/#19.29/50.0816335/19.8819302

(I can trying adding spider)

Cj-Malone commented 8 months ago

This one is a bit of a weird one, well a pain. The data is only in javascript. We have chompjs that can parse it, but it's not ideal, in this case it turns Latitude : 50.08325 into "Latitude ": 50.08325 (trailing space). We can work with it, or replace it, but it may cause us issues if there are :s in the addresses, so it may be better for us to tolerate it... Your call.

Anyway, here is where I got to. Hope you can finish it off.

import re
from typing import Any

import chompjs
from scrapy import Spider
from scrapy.http import Response

from locations.items import Feature

class TestSpider(Spider):
    name = "test"
    item_attributes = {"brand": "KKM", "brand_wikidata": "Q57515549"}
    start_urls = ["https://kkm.krakow.pl/pl/punkty-sprzedazy-biletow/"]

    def parse(self, response: Response, **kwargs: Any) -> Any:
        for location_raw in re.findall(r"items\.push\(\s+({.+?})\)", response.text, re.DOTALL):
            location = chompjs.parse_js_object(re.sub(r"\s:", ":", location_raw))
            item = Feature()
            item["ref"] = location["Id"]
            item["lat"] = location["Latitude"]
            item["lon"] = location["Longitude"]
            print(location)  # TODO: address? TypeId/TypeName
            yield item
Nykakin commented 7 months ago

Hello, chompjs author here.

Fixed this issue (https://github.com/Nykakin/chompjs/pull/57), released in 1.2.3 version.

>>> import chompjs
>>> 
>>> chompjs.parse_js_object("{Latitude :  50.08325}")
{'Latitude': 50.08325}

Please don't hesitate to open issues if you encounter any problems with the library.

matkoniecz commented 7 months ago

we are using old version for now, see https://github.com/alltheplaces/alltheplaces/blob/master/Pipfile.lock#L403

I guess it waits for dependabot to notice new release ( https://github.com/alltheplaces/alltheplaces/commits/master/Pipfile.lock ) ? Or maybe it should be upgraded manually....

https://github.com/alltheplaces/alltheplaces/pulls/app%2Fdependabot has no open issues

Cj-Malone commented 7 months ago

Feel free to update chompjs in the same PR. $ pipenv update chompjs

matkoniecz commented 7 months ago

hmmm, I tried this - and it fails with

pipenv.patched.pip._vendor.packaging.specifiers.InvalidSpecifier: Invalid specifier: '22.10.0'

likely referring to twisted

https://gist.github.com/matkoniecz/cbb5f231bf6c27166b2fb026cc59beea has logs

According to https://stackoverflow.com/questions/52855310/pipenv-packaging-specifiers-invalidspecifier-invalid-specifier#52855311 this will happen when specifier is not matching expected format.

No idea what should be done if package actually has such version.

iandees commented 7 months ago

The upgrade to chompjs 1.2.3 happened in https://github.com/alltheplaces/alltheplaces/pull/7341.

Twisted is pinned because of https://github.com/alltheplaces/alltheplaces/pull/6430. I'll remove that pin because there's a new version of scrapy now.

matkoniecz commented 7 months ago

7610 has first attempt with it sort-of-working (still figuring out how to selectively disable NSI matching, I found docs mentioning that it can be done but not how)

Cj-Malone commented 7 months ago

If it's only some of the POIs, set a value, we typically use item["nsi_id"] = "N/A". If it's the entire spider, disable the pipeline:

    custom_settings = {
        "ITEM_PIPELINES": ITEM_PIPELINES | {"locations.pipelines.apply_nsi_categories.ApplyNSICategoriesPipeline": None}
    }
matkoniecz commented 7 months ago

Would it make sense to mention it at https://github.com/alltheplaces/alltheplaces/blob/master/docs/WIKIDATA.md ?

There is a tantalizing hint "In the cases where it fails for some reason then the automation can be disabled for the spider" that I found but was later stuck.

Cj-Malone commented 7 months ago

I'm gonna write some docs for all pipelines.

matkoniecz commented 6 months ago

https://github.com/alltheplaces/alltheplaces/pull/7610 should have implement that