Ukraine-Relief-Efforts / scraper

Scrape border crossing data from countries bordering Ukraine
MIT License
8 stars 4 forks source link

Write scraper logic for Slovakian border guard #5

Closed EchoRuby68 closed 1 year ago

Aziroshin commented 2 years ago

I've started looking into writing a scraper for the Slovakian border crossings. The first thing I'm doing now is creating a KML file for them.

Jachdich commented 2 years ago

@Aziroshin What's the status on this? Also where are you getting the data to scrape from?

Aziroshin commented 2 years ago

@Aziroshin What's the status on this? Also where are you getting the data to scrape from?

I'm still gathering and verifying data, but I have an open project in Google Earth with the 4 border checkpoints that aren't freight rail only (Chop (rail), Velyki Selmentsi (road), Malyy Bereznyy (road) and Uzhhorod (road - temporarily including pedestrians)), so I could at least provide a KML file. I'm about as confident that I got the right placemarks as I'm going to get, I think.

I've documented my research in issue #10.

I haven't gotten around to the code bit yet, and I'm getting close to end-of-weekend. Chances are I'll be hard pressed to eek out reasonable time slots during the week, so, if someone's ready to pounce on this, this would be a good moment, I guess.

I've been thinking a bit about how I'd go about it, and, given the data I have found so far, I'd parse https://www.minv.sk/?hranicne-priechody-1, since the Slovakian authorities are probably the closest-to-metal source we can get on this, and use the checkpoints there as "keys" to scrape complementary data from other sources and associate them with these checkpoints.

At least for starters, it might also be okay to hard-code certain values, or to have a sort of "standard" for additional properties, like Slovakian and Ukrainian phone numbers, or manually gathered trivia that might be valuable to know. After all, I doubt the addresses of the checkpoints are likely going to change very often, for example.

Then again, perhaps it'd be better to hard-code values like these on the frontend, but it might also be nice to have them in a structured manner already (albeit the scraper might be the wrong place for that... maybe).

Of course, there are also the Ukrainian counterparts:

A scraper could be written for these, but it'd have to be considered that the website might get compromised, or even go down at some point (let's hope not).

Aziroshin commented 2 years ago

I just tried to export the KML data from Google Earth Pro, but once imported into Google My Maps the placemarker data contains a malformed link. Here's the link to the map anyway: https://www.google.com/maps/d/u/0/viewer?mid=10XSF6CAHYmRA7htR2j-7YxGNmHwLuf1m&ll=49.32994140486006%2C22.460267424595084&z=7 (mind you, it only contains 4 checkpoints, excluding Pavlov and the freight one in Uzhhorod (I didn't find all that much about either of them)).

In any case, the Google Maps links in #10, or the coordinates on https://dpsu.gov.ua/ua/NA-KORDONI-ZI-SLOVACKOYU-RESPUBLIKOYU-2018/ might be of more immediate interest to the scraper right now anyway.

Aziroshin commented 2 years ago

I've been poking around a bit (making a bit of a mess), but spent most of it running into issues with my KML files. The current state of things is on this fork: https://github.com/Aziroshin/scraper. If anyone else takes this issue, feel free to fork, copy-paste or hit me up for a pull request, ask questions or request clarifying comments - whatever seems like the best way forward. My time during the week will be quite limited, so, don't wait for me.

Jachdich commented 2 years ago

Wow, thanks very much for all that information. I might take a look at coding some of that stuff at some point, after I'm done with the translator. Personally I think hard-coding values in the scraper is better than in the frontend, so if/when we change them to non-hard-coded values all the changes are in one place. But I might be overlooking something.

Aziroshin commented 2 years ago

@Jachdich I've detailed it in #10 , but it's important enough to mention it here as well: I've seen a mention (one that I've overlooked before) that Chop is closed to "individual traffic" (not sure how to interpret that - Chop is a railway checkpoint (as far as entry into Slovakia is concerned, at least)), and I found three more sites with lists of Slovakian border checkpoints, but none of them mention Chop. The only checkpoints listed there are the ones connected to Uzhhorod, Mali Selmentse and Malyi Bereznyi.

I think we should be careful in how we handle Chop when presenting it to people seeking information for now.

robcampbell79 commented 2 years ago

@Aziroshin I don't know if this will be helpful but here is a site list checkpoints between Ukraine and Slovakia https://www.vashtransfer.com/en/2018/03/11/border-crossings-from-ukraine-to-hungary-slovakia-poland-en/ It only has checkpoint names and they are in English.

Jachdich commented 2 years ago

@Aziroshin are you still actively working on this? if not, I need something to do so I might pick it up

Aziroshin commented 2 years ago

@Jachdich You can pick it up. I haven't gotten around to it during the week, so the mess in https://github.com/Aziroshin/scraper/commits/master is all there is. I used the Hungary scraper as a guideline and then, after stalling for a while, managed to get a latitude and longitude out of my KML files. The usefulness of that is questionable anyway, though, because one could just mock the coordinates directly if taking that route. ^^"

There's also a bit of code that downloads a website to a file and allows for poking around in it without making unnecessary requests - however, I'm not well acquainted with BeautifulSoup and whether this is, perhaps, quite redundant.

Jachdich commented 2 years ago

@Aziroshin Cool, thanks very much! I'll see if I can do it the simple way lol

WillDHB commented 2 years ago

@Jachdich Did you finish this? I see there's a Slovakia scraper that you wrote.

Jachdich commented 2 years ago

@WillDHB Depends on your definition of finished - it works, but it could be improved

WillDHB commented 2 years ago

Improved in what way(s)?

Jachdich commented 2 years ago

Firstly, some of the hard coded info could be scraped, but I'm not sure if it's worth the trouble. Secondly, there doesn't seem to be addresses, only names (currently the address field is just set to the name field). I'm not sure how to go about fixing that

WillDHB commented 2 years ago

Okay. Would you mind creating separate issues for those problems, with a little background on whatever trouble you ran into so that we can all track what's going on a bit better?

Jachdich commented 2 years ago

Sure, I think that would be a good idea

Aziroshin commented 2 years ago

Firstly, some of the hard coded info could be scraped, but I'm not sure if it's worth the trouble. Secondly, there doesn't seem to be addresses, only names (currently the address field is just set to the name field). I'm not sure how to go about fixing that

I went over the Ukrainian addresses of the most interesting checkpoints in this issue #10 comment a little more, prying apart the address strings from the Ukrainian customs' website. I'm fairly confident in the results, except where stated otherwise. It seems to check out against google maps and a few other google searches.