cal852 / COMP3111-Project

Web scraper for COMP3111
Do What The F*ck You Want To Public License
2 stars 0 forks source link

Scraping Accuracy Issue #1

Open cal852 opened 6 years ago

cal852 commented 6 years ago

I know it's not our code, but I would like you guys to take a look at the latest commit for task 1 I've made and see if it is worth informing the TA for since the scraping of prices may not be accurate.

enochwong3111 commented 6 years ago

I found that if the XPath change to your new one, it may over count twice of the element, since there may be two price span for one object. But I don't think we need to tell the TA since we won't/cann't use the same website to do this project, so we need to change the path by ourselves to find elements in different website.

cal852 commented 6 years ago

I thought that we have to scrape both Craigslist and another website of our choice? So we add on another website,but results from Craigslist take priority over our choice.

According to this text I found, "The returned result of the function WebScraper.scrape contains the merged data from two portals. [5]"

enochwong3111 commented 6 years ago

May be need to ask TA whether we need to change the portal or add portal or not, since it also state in the task that it does not required us to handle multiple pages data, but need to use a portal other than Craiglist?

2.Be able to scrape data from a single webpage of a local/international selling/reselling portals (e.g. carousell, dcfever, preloveled, taobao or any similar webpage. Please noted that we only accept websites written in English or Chinese) Note, there is no requirement to handle multiple pages data.

enochwong3111 commented 6 years ago

If portal Craigslist is needed, then, change the path to '//[@id="sortable-results"]/ul/li/p/span[2]/span[@class="result-price"]' that would give the right elements for price. but I found that some Item has no price tag as show below, still have some problem. image Another method to do is using '//[@id="sortable-results"]/ul/li/p/span[2]/span[1]' and check the content contain '$' or not