OSMLatvija / Osmalyzer

Parsing OSM data in Latvia against various data sources
https://osmlatvija.github.io/Osmalyzer/
GNU General Public License v3.0
2 stars 1 forks source link

Rīgas Satiksme ticket vending machines #29

Closed HellMapGoesCoding closed 1 month ago

HellMapGoesCoding commented 6 months ago

https://www.rigassatiksme.lv/lv/biletes/bilesu-tirdzniecibas-vietas/bilesu-automati/

markalex2209 commented 1 month ago

Map used there is a user created map from/on google. Full (non-embedded) version: https://www.google.com/maps/d/viewer?mid=1fHZLaJ1t5cPs9PbaUotV_-IlwVs , not sure it it preserves the link on update. When posting this, last update of the map was May 7.

That same view allows to download KML/KMZ. I think it's possible to parse/generate link for such a download. To evaluate how possible this solution:

  1. Will google allow do download KML/KMZ file without captcha?
  2. Can we reasonable parse KML/KMZ file format?
  3. If link is preserved or it's needed to be found from page every time.
HellMapGoesCoding commented 1 month ago

I am already doing this in another place - downloading KML from google maps URL at GlikaOzoliAnalysisData

  1. Will google allow do download KML/KMZ file without captcha?

I have not encountered a captcha... yet.

  1. Can we reasonable parse KML/KMZ file format?

KML output can be forced in URL. And I am using a SharpKml package that can parse it.

  1. If link is preserved or it's needed to be found from page every time.

The URL in the code appears to be https://www.google.com/maps/d/embed?mid=z04Qg9kXVnqk.kwqapebDyDDY, but clicking it takes you to 1fHZLaJ1t5cPs9PbaUotV_-IlwVs. I assume RS updated the map, used the old URL, and Google now "redirects".

For Glika stuff, I parse the website for the URL each time. I generally prefer to do it this way because I have no idea if they will make a new map or update existing. I think it should be done this way here too, since the above IDs already mismatch...

markalex2209 commented 1 month ago

I am already doing this in another place - downloading KML from google maps URL at GlikaOzoliAnalysisData

Yes, already found and implemented in a similar manner. See #38.

For Glika stuff, I parse the website for the URL each time. I generally prefer to do it this way because I have no idea if they will make a new map or update existing. I think it should be done this way here too, since the above IDs already mismatch...

Done the same, for consistency and to avoid confusion if it were to be replaced with change of link.

HellMapGoesCoding commented 1 month ago

Failed to get data on GitHub (worked locally).

Inner exception message: An error occurred while sending the request.
Inner exception message: The response ended prematurely.

Well, this is as vague as it gets...

markalex2209 commented 1 month ago

Sudden network issue? Before changing anything, can you start it once more, just to check?

HellMapGoesCoding commented 1 month ago

Same result, unfortunately. I am not sure if it's RS site or Google that fails, would need to add some debug output. The other Google call works, so it's possible it's actually RS failing.

HellMapGoesCoding commented 1 month ago

It was indeed

Exception: Failed to read RS page

So I switched to browsing. But then it decided that

Exception: Failed to read Google Maps kml page
Inner exception: Response status code does not indicate success: 404 (Not Found).

So I checked what URL it's using and apparently

Exception: Failed to read Google Maps kml page (https://www.google.com/maps/d/kml?mid=&forcekml=1)

So the ID is missing for some reason. Since it was assuming Regex passes, I made sure to actually check and sure enough:

Exception: Couldn't parse RS site html for the Google Maps KML ID

So I dumped the received HTML to see what it actually got.

Exception: Couldn't parse RS site html for the Google Maps KML ID (saved html dump in output 'RS-vending-html-dump.html')

And apparently it's https://osmlatvija.github.io/Osmalyzer/RS-vending-html-dump.html :

<html><head></head><body></body></html>

So I went ahead and also actually recorded the headers from the most recent browsing call

Exception: Couldn't parse RS site html for the Google Maps KML ID (saved html dump in output 'RS-vending-html-dump.html' and headers in 'RS-vending-header-dump.html')

But there are literally none https://osmlatvija.github.io/Osmalyzer/RS-vending-header-dump.html .

So I am assume that HTML is likely just "browser" placeholder and it never "connected" anywhere, at least it didn't get past a point where it would start sending headers. So I suspect this is related to the SSL problem from the very first message. So I'm setting the browsing to ignore cert/SSL errors. Which didn't change anything. It is possible GitHub is not going to let me do such "broken" connections for security reasons.

So for now, that's that.

Will probably just end up just hard-coding the ID.

markalex2209 commented 1 month ago

SSL problem was in another feature: one of the parcel lockers (IIUC, Venipak).

Don't think this might be SSL related. Any issue with SSL/TLS would result in failure to establish any connection, and you would not get even empty html page (unless that empty page was not actually received, and was simply returned by browser engine).

Tested if the site is accessible outside of Latvia, and it is not (US vpn):

curl https://www.rigassatiksme.lv/lv/biletes/bilesu-tirdzniecibas-vietas/bilesu-automati/
curl : The underlying connection was closed: The connection was closed unexpectedly.
At line:1 char:1
+ curl https://www.rigassatiksme.lv/lv/biletes/bilesu-tirdzniecibas-vie ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand
HellMapGoesCoding commented 1 month ago

Hmm, It looks like it's blocked outside Europe. It works with VPN from Sweden, Denmark, Greece, UK but not US, Brazil, Australia, Egypt or Japan. So it's not fully blocked, but wide enough to not work from US GitHub runner.

unless that empty page was not actually received, and was simply returned by browser engine

I believe that is exactly what it is - just an empty placeholder page that the browser creates so that it can "render" nothing while keeping a legal HTML output. I imagine there are good reasons to do it. Since there are no headers received, then I am almost certain, this isn't any sort of response from the RS site.

HellMapGoesCoding commented 4 weeks ago

Anyhow, I made it fall back to hard-coded ID. It will still attempt to get the site, but failing that, just use the known ID.