Shaheerairaj / car-value-estimator-team

1 stars 5 forks source link

Develop web scrapping script #2

Closed Shaheerairaj closed 5 months ago

ohinson01 commented 5 months ago

HI @Shaheerairaj. Just wanted to know, what is the website we will be scraping? Will need the website to know what to scrape when creating the script. Thanks.

Shaheerairaj commented 5 months ago

We'll be using this url: https://uae.dubizzle.com/motors/used-cars/

All used cars will be under this URL. Our job is to figure out how we can filter the search results for the car brands that we want and iterate through the pages.

I'd like you to make a branch web_scrapper_olivia which will contain your solution to the problem.

My idea is that everyone come up with their own solution to the problem and then we can either use the best solution or merge different solutions together. But I would like everyone to try their hand at the problem and see where they get.

ohinson01 commented 5 months ago

Ok. Sounds good.

Shaheerairaj commented 5 months ago

Heads up: Don't go into the details. Most of the information we need is right outside on the main page. If you were to open the link of each ad, we can get a lot more useful information but that significantly increases the complexity.

We'll stick to the info available listing page.

ohinson01 commented 5 months ago

Question: Do we only want to scrape data for the description of each car or do we also want to include the image of the car?

I marked what I mean in red in the image below.

image
Shaheerairaj commented 5 months ago

Well, we're solving a regression problem. The goal is to be able to predict price so we need features which will be useful to that end goal.

I'll leave it for you to determine what you think is necessary for that outcome.

ohinson01 commented 5 months ago

Ok. Gotcha. I will see what I come up with.

ohinson01 commented 5 months ago

@Shaheerairaj Created a PR #26.

It is not finished yet. I am a bit stuck on iterating through the pages to retrieve the necessary information. I was wondering if you had any ideas? I have looked on Stackoverflow and any resource to get to the bottom of the error I am having, but haven't had any luck so far. I will keep looking, but wanted to share what I have done so far in case you can provide some input.

ohinson01 commented 5 months ago

If you cannot view it through this PR, I have also posted an attachment of it in the SDS chat.

ohinson01 commented 5 months ago

2024-MAY-14

Just as a note, I figured out what went wrong with my code and why I kept receiving an error. I put my explanation in the SDS chat, but will also add it here for anyone else to reference:

Ok. I figured out what was happening. After rethinking my strategy, turns out, I had to put the listings variable into the for loop so that it could recapture the element since the element is destroyed once it flips through the pages.

As a note, I kept my original button click() code due to receiving an error when I just used next_page_button click(). The reason is because due to the pop ups that come up in the window, it cannot locate that button anymore. The original code actually prevents that from happening.

image
ohinson01 commented 5 months ago

2024-MAY-14

@Shaheerairaj PR #27

This isn't finished yet since I still need to iterate through all pages and save to CSV file, but this is what I have so far to get it to work.

Shaheerairaj commented 5 months ago

In your experience, do you only merge once it's all completed?

In my perspective, it's better to push code frequently and keep iterating. I was assuming this is how it works in traditional tech teams as well.

ohinson01 commented 5 months ago

From my experience working in a tech team, we only merged once everything has been validated and tested fully (i.e., make sure all test cases are taken care of and ensure previous functionality is still working as expected). The reason is because when you are working in a production environment and everyone has access to the code, you want to make sure your code changes are valid and complete so everyone can then pull your code into their environment to be tested and validated before pushing to production. We had a process for doing this and this is how it usually worked in a production environment setting. It also had to do with merging changes to the main branch of the Repo. Once we made our changes and pushed to the repository, the one who would check our Pull Requests would merge it into the one working branch.

In our case, since we are not in a production environment, then it is perfectly fine to frequently push code changes and keep iterating to keep track of these changes. For instance, I keep sending PRs to keep track of progress and to not lose anything in case I lost the file on my side. It depends on the circumstances in which you are pushing your code, but in this case, you can push code frequently and keep iterating without breaking anything.

That is just from my experience, but maybe @glasseyes has more insight into this.

Does that make sense?

ohinson01 commented 5 months ago

@Shaheerairaj This just occurred to me, but did you want to filter for a specific car brand to add to the CSV file? I was assuming we wanted to add all car brands from the pages we scrape from, but then I saw your comment in PR #29 .

Shaheerairaj commented 5 months ago

We only want specific car brands like those I mentioned in the kick-off meeting. I just realized I didn't put up a reference to the exact car brands we discussed, I'll add them to the README.

In my PR, I wanted to point out that it is important to not scrape for all car brands at once since there is risk of your IP being blocked by the host webserver. I'm not sure what the limit is but better to scrape in chunks rather than all at once.

Also, I want to limit the issue of hitting their captcha checks for security measures.

ohinson01 commented 5 months ago

Ok. Gotcha. I will look out for the car brands you would like us to scrape. To be honest, I couldn't remember what we said at the meeting on whether or not we were doing that.

Shaheerairaj commented 5 months ago

I've added the car brands in my-notes

ohinson01 commented 5 months ago

@Shaheerairaj For PR #29, I figured out how to automate filtering by brands of the car. Below is my code snippet and what I did to do it.

image

XPATH: //*[@id="lpv-list"]/div[2]/div[2]/div/div/div[1]/div[2]/div/div[2]/div/div/ul/li[2]/span

image

I didn't want to comment in the PR itself since it was already closed. Wasn't sure if you still got notifications for closed PRs if I commented in it.

Shaheerairaj commented 5 months ago

That's a good solution. I'm interested in seeing how this work for the list of car brands. Update me when you've tried it.

For my solution, since I'm using the following base_url: "https://uae.dubizzle.com/motors/used-cars/toyota/" I've chosen to take a more manual approach and change the car brand in the base_url.

I don't want to run a loop for the car brands as well since I've constantly experienced a weird issue where the script would just get stuck for hours and then it might resume or will close down. Usually, it would fix itself if I re-run the script.

Speaking of which have you experienced something similar? And have you hit the captcha security checks at all?

ohinson01 commented 5 months ago

For the list of car brands, I am still brainstorming how to tackle that. I am thinking of maybe opening multiple browsers to fetch all that information instead using a single driver to capture the data, but still unsure if this approach would be ideal. I will let you know if I find anything out about it.

As for the issue you were experiencing, I haven't had any so far. My loop runs smoothly and doesn't seem like it hits any captcha security checks. I do notice that it returns blank output for page 2 and then continues gathering data. I am not sure why that is the case. Of course, my browser has ad blocker and an extension to prevent malware. I am not sure if that might affect how it appears on my side.

Shaheerairaj commented 5 months ago

I have an ad blocker as well but the issue where the code just freezes was very annoying and impossible to diagnose.

You don't have to come up with an ideal solution. Remember we mentioned in our vision for this project that we want to prioritize speed of delivery over a perfect solution. Whatever solution works to get the data that we need.

ohinson01 commented 5 months ago

Ok. Got it. I will create a loop then and see how that goes.

It is strange that you see that issue and I am not seeing it.I am thinking more like a user and performing actions that I would normally do on the website. Maybe it senses something in your code that doesn't seem quite human, but would need to look at it to confirm my suspicions.

ohinson01 commented 5 months ago

@Shaheerairaj PR #30

ohinson01 commented 5 months ago

@Shaheerairaj I believe I have created a working script now that loops through all car brands and gathers the data dynamically. I tested this yesterday and it worked as expected. I will continue testing today to ensure everything is still functioning. I also need to make minor adjustments to my code.

ohinson01 commented 5 months ago

Also, I added all car brands to the CSV file instead of splitting them up. Is that the expected behavior we would like to have?

Shaheerairaj commented 5 months ago

That's perfectly fine. The only requirement is to have as much information about the ads as possible. Keeping all data in one file does help further down the line when conducting analysis since you only need to work with one file instead of multiple.

ohinson01 commented 5 months ago

Ok. I do have a question on the notes.md file. Could you explain what you mean by these items? I didn't include them yet since I am unsure if I should or not. I do have the ID, but I did a random uuid since I couldn't find it when looking on the webpage. So far, I only have the price, brand, model, year, mileage, and location.

Ad ID Ad link Ad Title Ad tag

Shaheerairaj commented 5 months ago

Good eye.

Ad ID: The is the unique ID the website uses which you can find in the URL to the individual add which you can find in the href attribute of the ad listing. It is the item which comes right after three hyphens '---'.

Ad link: This is the link mentioned in the href attribute I talked above.

Ad Title: The title of the add which is the largest text of the description of the ad mentioned right below the car brand and car name and above the model year, mileage, and regional specs.

Ad Tag: I noticed some ads have a tag on the top right of their item box and I thought it might be useful to collect that as well. Eg. premium, car of the day, featured.

Shaheerairaj commented 5 months ago

I noticed that you deleted your file, why is that?

Also, I didn't know that you are able to delete without having approval. Or did I approve without knowing?

ohinson01 commented 5 months ago

I deleted the file for a couple reasons.

  1. Just wanted to improve my code a bit
  2. Also, after our discussion on retrieving the URL from a environment variable, I wanted to delete my file and reupload my file with the improved code all at once.

I deleted it in the Repo itself without performing a Pull Request. I do not know why I could do that, but it allowed me to.

As for why I could delete my own file in the Repo, I think you might need to check the permissions.

https://docs.github.com/en/account-and-profile/setting-up-and-managing-your-personal-account-on-github/managing-user-account-settings/permission-levels-for-a-project-board-owned-by-a-personal-account

ohinson01 commented 5 months ago

Apologies for not saying anything about it. I will create a PR to add it back shortly.

ohinson01 commented 5 months ago

@Shaheerairaj The PR #30 has been updated to include my complete file for web-scraping. I might do another PR in a bit to improve the code, but for now, here is the working code I have.

Shaheerairaj commented 5 months ago

Sure. Can you do one thing for me please? I'm running a bit behind schedule on my end. My script works but I'm taking a bit of time to gather all the data since I'm doing it bit by bit to prevent being blocked.

In my list of car brands, I'm going from top to bottom and am on Hyundai right now. Could you run the script on your end running from bottom to top please? This should speed up the process and we should have the necessary data to go on to the next phase of the project.

The only change you need to do is change the base URL to the car brand name you are searching for in lower cases. Eg. lexus, audi, bmw, mercedes-benz.

ohinson01 commented 5 months ago

Sure. I can do that. I just need to search for kia, ford, chevrolet, volkswagen, lexus, audi, bmw, and mercedes-benz?

ohinson01 commented 5 months ago

Also, I am now running into that CAPTCHA on my end when it wasn't doing it before. Not sure why it decided to do it now, but will try to get through it as much as I can.

Shaheerairaj commented 5 months ago

I've managed to catch that error. I'm sharing on the group chat. Do have a look.

Shaheerairaj commented 5 months ago

Btw for the car brands, just work your way from the bottom. I'm currently on KIA now so we'll meet somewhere in the middle depending on any issues each of us faces.

ohinson01 commented 5 months ago

I am now running for Lexus and will work my way up

Shaheerairaj commented 5 months ago

I've got until Ford. I'm going to stop the script for today just to give a bit of a gap.

Do see where you get. If you manage to get to Chevrolet, then that would completed the data collection phase.

Shaheerairaj commented 5 months ago

Updated the script to remove the base URL and store it in an environment variable instead.