Test out historical-data feature to ensure that it's production ready

Milind220 commented 2 years ago

@Sam-damn

I tried this out today, and it worked great for the one city that I tried it with. I'm going to try it a few other places tomorrow.

Samxx97 commented 2 years ago

@Milind220 that is great! did you notice any performance issues in speed of execution like i did ? also i should point out that i think this feature or rather (the branch) isn't read to be merged because i still have to do minor adjustments to the doc strings of the methods and the module itself so that it is consistent with the other module.

and one more thing , i noticed that the search bar that is used for the search query that we fill up before scraping the data , doesn't offer options for all countries , for example i tried to input my capital city Damascus and there were no results, i didn't try to see what happens in the code in that case so we definitely should be able to handle those edge cases, and also there is the case where the name of city does bring down a list of suggestions but none of which exactly matches the user's query and so i think a valid approach here in this case is to inform the user of this and instead provide him with the list of alternatives he could use instead of his original query.

Milind220 commented 2 years ago

@Sam-damn Yeahhhh I did. It took about 5 minutes for it to fetch the data for me, but we can't do much about that. I don't think it's a problem though - It would take about as long for a person to get the data manually, and this way it's done automatically.

if the search returns options but none match exactly:

As you said, we should provide the user with the options and let them choose which one to use.

If the search returns no options, such as with Damascus:

To start with we could just say that there's no data available for that location. The user could then try alternatives themselves and see what works.

Later we can improve this in a new version 1) Get the coordinates of the location that they want data for (using the google maps api for example) 2) Use these coordinates in get_coordinate_air method to get the name of nearest station which provides air quality data (for Damascus this showed me: Upper Galilee - Tel Hai, north, Israel (I know this isn't very close but I'm not sure what else we can do) 3) The name of this station can then be used in the historical data method to get that

Samxx97 commented 2 years ago

@Milind220 that seems like a solid plan! nice thinking 😁

Milind220 commented 2 years ago

@Sam-damn I recently got a new computer, and after setting it up, this somehow no longer works for me. Some strange errors are thrown every time. I'll update you on this tomorrow after seeing if I can fix it myself (If it's just a computer configuration issue)

Samxx97 commented 2 years ago

@Milind220 it probably has to do with the browser itself this is the annoying thing about selenium, can u show me what error it threw?

Milind220 commented 2 years ago

This is the final error message (it won't even let me import ozone)

SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line

Full error trace:

Milind220 commented 2 years ago

@Sam-damn Do you think it's perhaps that WebDriver-manager isn't downloading the right WebDriver? It's mentioning Mozilla Firefox, but I thought we'd be running it on Chrome?

Samxx97 commented 2 years ago

@Milind220 iam using Mozilla web driver in the implementation not chrome , because it’s simpler less light weight and it’s usually preferred

Samxx97 commented 2 years ago

After searching the web for this issue I found this here apparently according to the question this error appears because Firefox itself isn’t installed on your system but isn’t a web driver only what selenium needs for running ? And isn’t that what webdriver-manager doing ? we will haven to look more into this

Samxx97 commented 2 years ago

@Milind220 can you check your Home folder , inside of which there should be a .wdm directory which has a drivers folder inside of it , Firefox web driver manager should be downloaded there by Webdriver-manager , and then it is passed into selenium from the return call value of install() method as u can see in the line where web driver manager global module instance is being instantiated

Samxx97 commented 2 years ago

Also the reason why it wouldn’t let you import is because , upon importing a module is when it’s global instances are defined , and that’s where selenium instance is being instantiated , and the selenium instance instantiation line of code is where the error is being raised

Milind220 commented 2 years ago

After searching the web for this issue I found this here apparently according to the question this error appears because Firefox itself isn’t installed on your system but isn’t a web driver only what selenium needs for running ? And isn’t that what webdriver-manager doing ? we will haven to look more into this

Yeah this is exactly what I thought WebDriver-manager would help with, I'll give it some thought too!

Milind220 commented 2 years ago

@Milind220 can you check your Home folder , inside of which there should be a .wdm directory which has a drivers folder inside of it , Firefox web driver manager should be downloaded there by Webdriver-manager , and then it is passed into selenium from the return call value of install() method as u can see in the line where web driver manager global module instance is being instantiated

.wdm directory is there, inside of which there is a drivers folder, which contains Chromedriver and Geckodriver, but no Firefox driver

Samxx97 commented 2 years ago

@Milind220 geckodriver is the firefox one, the folder should have the binary inside of it geckdriver

Milind220 commented 2 years ago

@Sam-damn Oh okayyy that makes sense.

Milind220 commented 2 years ago

@Sam-damn So I got it to work with a few adjustments

I changed every mention of Firefox to Chrome
I changed every mention of Geckodriver to Chromedriver

It then fetched the historical data for London with no problem. I think this is probably because my computer has Google Chrome on it, but not Mozilla Firefox. Webdriver manager must still require you to have the actual browser in order for the downloaded binaries to work.

The easiest way to fix this that I can think of is to have the user include the browser that they wish to use as an argument in the get_historical_data method. Then, based on what they enter, we can initialise their browser of choice in create_selenium_webdriver

Milind220 commented 2 years ago

Here's what the edited code looks like for me. This is all it needed to work for me.

(Also, the fact that your original code worked on my old computer makes sense - my old computer had Mozilla Firefox installed on it.)

Samxx97 commented 2 years ago

@Milind220 OHH this is actually quite surprising to me i didn't think it would work with chrome that easily without introducing changes to the code hahaha, great in that case i love your idea of incorporating the user's choice into this, how about we extend this even further and first we try the user's driver choice and if it raises exception we catch it and try the next driver until it works , this way i think is more robust since sometimes the user might not know or might make a mistake , he wouldn't like to see that ugly exception xD

Samxx97 commented 2 years ago

also @Milind220 one thing worries me tho, if the page changes its HTML style in the future (which is unlikely) , the code will break, should i leave those worries for a future issue? as an enhancement to making the code completely robust (iam not sure if its possible tho ) since with selenium you mainly depend on HTML style elements to find the bits of data you want to scrape

Milind220 commented 2 years ago

also @Milind220 one thing worries me tho, if the page changes its HTML style in the future (which is unlikely) , the code will break, should i leave those worries for a future issue? as an enhancement to making the code completely robust (iam not sure if its possible tho ) since with selenium you mainly depend on HTML style elements to find the bits of data you want to scrape

That's a very valid concern, and might happen in the future. Let's leave this for now - there's not much we can do about it anyways

Milind220 commented 2 years ago

@Sam-damn great idea about trying the users choice and then the other options if their choice fails. Let's do it like that

lahdjirayhan commented 2 years ago

Can I also develop this branch (hist-data, I assume)? Do I need to make PRs? I'm eager to try out and contribute to this feature, too, if still possible.

Milind220 commented 2 years ago

@lahdjirayhan Yeah that'd be great! This feature is going to go a long way in making Ozone more useful

All the details you need are in the conversation above

Milind220 commented 2 years ago

@Milind220 OHH this is actually quite surprising to me i didn't think it would work with chrome that easily without introducing changes to the code hahaha, great in that case i love your idea of incorporating the user's choice into this, how about we extend this even further and first we try the user's driver choice and if it raises exception we catch it and try the next driver until it works , this way i think is more robust since sometimes the user might not know or might make a mistake , he wouldn't like to see that ugly exception xD

We're aiming to implement this. The rest of it works for now!

Milind220 commented 2 years ago

@lahdjirayhan @Sam-damn Any updates on this? If you need me, I can help out with this too. Just lemme know what part you guys are working on so that we don't accidentally end up doing the same sections of code.

Samxx97 commented 2 years ago

@Milind220 can you try opening this url in a regular browser, iam not sure if its my internet connection but it wont open at all, this is the same URL that we scrape data from.

Samxx97 commented 2 years ago

Connection time out

could the website itself be busy? or is it a problem from my side of network

lahdjirayhan commented 2 years ago

@Milind220 I'm actually trying to find a way to scrape the data preferably without resorting to Selenium. My plans are:

Try to use the same way the target site's frontend fetches and shows the data. The pros of this approach are no browser/driver needed, just requests. However the cons are I need to first try really, really hard to know which JS script creates the table and then see if I can replicate that JS script in Python.
If that fails (or proves to be too hard or out of my knowledge entirely), I will be using requests-html package. It uses pyppetteer under the hood, which is responsible for automatically ensuring there's a local copy of Chromium in each device it's installed in (by first downloading and installing Chromium in its own custom path).
If that fails too (or proves to be hard too), only by then I'll follow the development path outlined in this conversation.

I'm currently on method 1 and currently trying to find which JS script is responsible for parsing backend data and displaying as a table. It feels like finding a needle in a metric ton of haystack since I'm not at all well-versed in JS.

Note: If any of you @Milind220 @Sam-damn have been on this path and can warn me that it is futile, it is welcome lol. I'm just uncomfortable with resorting to Selenium as my first choice of experimentation, given that I know of two other possible methods that can (potentially) work. Besides, you two have worked on Selenium so I personally thought why not try something else. Sorry for not notifying you all about this earlier.

lahdjirayhan commented 2 years ago

@Sam-damn works fine on my end. Your network could likely be at fault.

Samxx97 commented 2 years ago

@lahdjirayhan

Using selenium wasn't our first choice at all, we actually went down the path of trying to automate this whole process using just requests we had a full discussion on this and the full process we went through is detailed here

That alone won't be enough, the client side receives the encrypted (or possibly encoded) table as a response to a request made by the browser itself , from what i saw (i inspected the network tab for any relevant requests made by the my browser upon inputting the desire city and pressing the button), it also appears that only the client knows how to decrypt the table info sent back from the server and eventually rendering it as table element. And regarding locating the JS element which handles the generation of the table (it is actually like finding needle in haystack), and reverse engineering it to replicate its behavior in python, i did try that but the code was too obfuscated and complicated for me to understand, if u want i can point you towards which JS element that is doing this so you can try if u are interested, nevertheless if you can achieve this it would be very great as we won't have to rely on web driver automation but from what i saw i don't think it is possible.
I have actually done a little bit of research on this library and from what i have read is that this library can handle any task that does not require any interaction with the page to dynamically render JavaScript elements , which is our case, since the table that we scrape isn't statically generated , it is dynamically generated after some interactions with the page (inputting the city name and then pressing a button), besides i don't see the point behind using another library that also uses web-drivers, since selenium already does its job very well and its usually the go to library for when an interaction with a webpage is needed. We will also just be refactoring the code to use this library instead of selenium which will be following the same logic.

lahdjirayhan commented 2 years ago

@Sam-damn I appreciate the explanation and the link to that old thread. I understand that Selenium was likely a last resort for the purposes of this package.

Thank you for the offer! For the moment, I think I'll try it myself and see if I manage. If I don't, then I will contact you for pointers on your past journey. I think I'm very close (but again, aren't we all always feeling that way when we are hopeful haha).
Along my attempts for achieving pure requests flow, I've managed to sidestep the necessity to interact with search form by:
- Sending the search request (GET) myself via this URL https://search.waqi.info/nsearch/full/new%20york
- It would then return array of search results that --among other things-- include city ID. For the correct New York, The ID is 3307.
- I would then go to this URL https://aqicn.org/city/@3307 and it would redirect to the city's air quality page.
- Within that page, there's a table for historical air quality data, just like what we're seeing in https://aqicn.org/data-platform/register
Given that I can get the direct link for the city's air quality page and would only need to render its contents to get the historical air quality table, requests-html does not sound very unreasonable to me -- unless I'm missing something very obvious. Plus, requests-html also handles automatically the need for "ensuring that there's a browser+driver to use", that's why I was thinking about it in the first place. Note that I haven't actually used this library for now (I'm still on method 1 after all) and can't tell yet if it can fully render the city's air quality page without errors or other hassle.

That all being said, yes, Selenium is the go-to for scraping and I'm not challenging that. I understand that it is not wise to switch to requests-html just because it has Chromium "bundled" with it. Thanks for your perspective!

Samxx97 commented 2 years ago

@lahdjirayhan i didn't know that there is another page which offers information regarding this table this is great! in that case yes i agree with you, this approach would definitely be better :) Good job

Samxx97 commented 2 years ago

@lahdjirayhan when i search the URL with the id specific for HongKong the page doesn't seem to have historical data for some reason https://aqicn.org/city/@3308

lahdjirayhan commented 2 years ago

Yes, that means you'll need to use other search results. The first one doesn't always work. I guess there should be a way to filter out those occasions, but I can't figure yet. @Sam-damn

For New York above, I have to use the second search result bcs the first one has no historical data for some reason. For London, I have to use the second search result because the first one points out to some other place called London. (not London, United Kingdom) For Moscow, the first search entry works.

etc, etc.

Samxx97 commented 2 years ago

Indeed but for Hong Kong the second location in the list is "Xiangzhou Hangkong Road , Xiangyang" which isn't hong kong but rather a sub part of it or something, but the weird thing is this doesn't happen in the original page for https://aqicn.org/data-platform/register when u type in HongKong and press on the suggestion, the table data appears normally, that is very weird.

Samxx97 commented 2 years ago

@lahdjirayhan if you type this URL inside the search bar https://api.waqi.info/api/attsse/3308/yd.json (the number in the URL is the id for the city) you will be prompted to download a JSON file which itself contains the historical data table, if you open it you will see the entries for each date and for each parameter, however the data for each parameter seems encrypted or encoded if you can figure out what encoding scheme is it or somehow decrypt it then this will solve the whole issue and it will be more robust since we won't have to rely on that page with the rendered table.

ps: i think the yd in the file name refers to "Yearly data"

lahdjirayhan commented 2 years ago

@Sam-damn I'm already aware of that for some time now. I'm exactly on the process of finding out how to convert that long encoded string into a nice row-wise table (just like what the site's client/frontend does). I think I'm already getting somewhere.

Milind220 commented 2 years ago

@lahdjirayhan @Sam-damn Woah guys, this is awesome! I just went through all the comments since I last commented and wow you guys have been busy!

if you type this URL inside the search bar https://api.waqi.info/api/attsse/3308/yd.json (the number in the URL is the id for the city) you will be prompted to download a JSON file which itself contains the historical data table, if you open it you will see the entries for each date and for each parameter, however the data for each parameter seems encrypted or encoded if you can figure out what encoding scheme is it or somehow decrypt it then this will solve the whole issue and it will be more robust since we won't have to rely on that page with the rendered table.

ps: i think the yd in the file name refers to "Yearly data"

This approach sounds difficult - but if we can figure it out it's definitely best. I think if it doesn't work then temporarily we can go with selenium (just to get the feature out there), and then tidy it up later with a more robust solution.

PS: Sorry for responding to this thread so late, I had an important deadline today and was really busy with that.

lahdjirayhan commented 2 years ago

@Milind220 @Sam-damn I've been able to successfully decode the server-sent data. I've tested it on some cities and with several (well-behaving) inputs and it works as expected. However, I've yet to try out bad inputs to catch errors.

As of this writing I haven't yet added the functionality into Ozone proper, and that's what I'm going to do for now.

Expect a PR soon for my work on this feature if everything goes well.

Samxx97 commented 2 years ago

@lahdjirayhan Very amazing! Good job man , looking forward to the pull request , Iam curious what sort of encoding scheme it turned out to be ?

lahdjirayhan commented 2 years ago

@Sam-damn I'm not sure. Custom, maybe? It turns out to be extremely too hard to de-obfuscate, understand, and rewrite the JS functions into Python. (Just too many obscure obfuscation techniques, and I'm not well-versed in JS..) So I don't really know what encoding scheme was used.

Instead, I isolate what I think is the relevant set of functions within the JS script on the website. Afterwards, I use this awesome package I've stumbled upon just recently to make the JS script run-able within Python: https://github.com/PiotrDabkowski/Js2Py. It is lightweight and fast.

And then I just ... run the decoding script, treating it as a black box..

Honestly tho, it felt like a cheap hack compared to my initial goal (pure Python reverse-engineered code), but hey at least so far it works.

Samxx97 commented 2 years ago

@lahdjirayhan Believe me not even javascript experts could ever understand the obfuscated functions that are defined in the on-click events , treating it as a black box like you did is probably for the best , can’t wait to see how you applied it 😊

Milind220 commented 2 years ago

@lahdjirayhan Woah! well done man! I'll test your code soon - excited to merge this in if everything works though

Milind220 commented 2 years ago

@Sam-damn Thank you for working on this with great dedication. Despite the fact that your solution for this was not merged, it certainly helped us understand the historical data page better. I'm closing this issue for now at least, as @lahdjirayhan 's PR has now been merged into dev.

Ozon3Org / Ozon3

Test out historical-data feature to ensure that it's production ready #58

if the search returns options but none match exactly:

If the search returns no options, such as with Damascus: