BitTigerInst / BitTiger-CS504-FAQ

CS504 后端工程师直通车 FAQ
20 stars 10 forks source link

Jsoup Crawler #43

Open xiayank opened 7 years ago

xiayank commented 7 years ago

I have a question about using jsop api to select the target element. Here is the HTML. image I want to get the href attribute value in <a>tag, which is under the <div class=bxc-grid__image bxc-grid__image--light>. I tried use

Elements elements = doc.select("div[class=bxc-grid__image   bxc-grid__image--light]");

to locate the div. It works. I followed the API E > F an F direct child of E . So the select css will be li[class=sub-categories__list__item]>a. Howerver, there is exception.

Anyone knows how to locate the <a>tag?

Thanks in advance! Jsoup select API URL OF ORGINAL PAGE

Here is the exception log:

Exception in thread "main" java.lang.IllegalArgumentException: String must not be empty
    at org.jsoup.helper.Validate.notEmpty(Validate.java:92)
    at org.jsoup.nodes.Attribute.setKey(Attribute.java:51)
    at org.jsoup.parser.ParseSettings.normalizeAttributes(ParseSettings.java:54)
    at org.jsoup.parser.HtmlTreeBuilder.insert(HtmlTreeBuilder.java:185)
    at org.jsoup.parser.HtmlTreeBuilderState$7.process(HtmlTreeBuilderState.java:553)
    at org.jsoup.parser.HtmlTreeBuilder.process(HtmlTreeBuilder.java:113)
    at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:50)
    at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:43)
    at org.jsoup.parser.HtmlTreeBuilder.parse(HtmlTreeBuilder.java:56)
    at org.jsoup.parser.Parser.parseInput(Parser.java:32)
    at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:135)
    at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:747)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:250)
    at test.main(test.java:26)
jygan commented 7 years ago

are you using "copy selector" in chrome?

nav-subnav > a:nth-child(7)

xiayank commented 7 years ago

大家在用jsou的时候会不会总是出现,就算对于同一个界面,同一个css selector,抓到的Element有的时候可以正常工作,取到要抓的东西。但是也有可能有的时候为空,有的时候报错IllegalArgumentException: String must not be empty? 感觉jsoup不是很稳定,很多时候会失败。

bihjuchiu commented 7 years ago

Same here. I thought it was Amazon blocking the crawler...

xiayank commented 7 years ago

If so, shouldn't there be 503 error?

xiayank commented 7 years ago

@bihjuchiu In class, John had the exception like org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://amazon.com.

bihjuchiu commented 7 years ago

Good point, maybe it's Jsoup problem...

jygan commented 7 years ago

@xiayank can you post the url and selector you are using, also tell me which item you want to crawl? i will take a look

xiayank commented 7 years ago

@jygan

URL list

https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg?ie=UTF8&node=11444071011
https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b/ref=nav_shopall_sa_sp_exfit?ie=UTF8&node=3407731
https://www.amazon.com/Hunting-Fishing-Gear-Equipment/b/ref=nav_shopall_hntfsh?ie=UTF8&node=706813011
https://www.amazon.com/soccer-store-soccer-shop/b/ref=nav_shopall_sa_sp_team?ie=UTF8&node=706809011
https://www.amazon.com/Fan-Shop-Sports-Outdoors/b/ref=nav_shopall_sa_sp_fan?ie=UTF8&node=3386071
https://www.amazon.com/Golf/b/ref=nav_shopall_sa_sp_golf?ie=UTF8&node=3410851
https://www.amazon.com/man-cave/b/ref=nav_shopall_sa_sp_gamerm?ie=UTF8&node=706808011
https://www.amazon.com/Sports-Collectibles/b/ref=nav_shopall_sa_sp_sptcllct?ie=UTF8&node=3250697011
https://www.amazon.com/Sports-Fitness/b/ref=nav_shopall_sa_sp_allsport?ie=UTF8&node=10971181011
https://www.amazon.com/b/ref=nav_shopall_lpd_gno_sports?ie=UTF8&node=12034909011
https://www.amazon.com/camping-hiking/b/ref=nav_shopall_sa_out_camphike?ie=UTF8&node=3400371
https://www.amazon.com/Cycling-Wheel-Sports-Outdoors/b/ref=nav_shopall_sa_out_cyc?ie=UTF8&node=3403201
https://www.amazon.com/Outdoor-Recreation-Clothing/b/ref=nav_shopall_sa_out_outcloth?ie=UTF8&node=11443874011
https://www.amazon.com/skateboarding-scooters-skates/b/ref=nav_shopall_sa_out_scooskate?ie=UTF8&node=11051398011
https://www.amazon.com/water-sports/b/ref=nav_shopall_sa_out_water?ie=UTF8&node=11051399011
https://www.amazon.com/winter-sports/b/ref=nav_shopall_sa_out_wintersport?ie=UTF8&node=2204518011
https://www.amazon.com/climbing/b/ref=nav_shopall_sa_out_climb?ie=UTF8&node=3402401
https://www.amazon.com/outdoor-accessories/b/ref=nav_shopall_sa_out_accout?ie=UTF8&node=11051400011
https://www.amazon.com/outdoor-recreation/b/ref=nav_shopall_sa_out_alloutrec?ie=UTF8&node=706814011

Selector:

Elements elements = doc.select("span[class=nav-a-content]");
System.out.println(elements.size());
//The element size sometimes equals to zero, sometimes not.

for(int i = 2; i <= elements.size(); i++){
       String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
       Element element = doc.select(css).first();
}

Item

The links of in the menu. image

Thanks!

jygan commented 7 years ago

just hard code up to 10 category and catch exception. for(int i = 2; i <= 10; i++){ String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")"; Element element = doc.select(css).first(); }

xiayank commented 7 years ago

The thing is that next time we crawl the page again. It may work or not. So there will be some products first be crawled, but never be crawl again. Does it affect our project? Since we need to compare the price between the different time we crawl.

jygan commented 7 years ago

we need to crawl the product again even if it's crawled already. your code sometime fail at this line? Element element = doc.select(css).first();

xiayank commented 7 years ago

My code has two problems: 1.Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty.

2.Sometimes elements.size() will be zero, sometimes not.

My solution for the link exploring crawler is to initialize all the url into a queue. Also, put the url failed into the queue. Quit the loop until there is no element in the queue. At the end, I can get all the links.

But we I design the product detail crawler. It is normal I have the elements.size() = 0, since not all the sub-category page have the same css selector, because some do not have the product list on the it . I cannot use the queue to do it.

jygan commented 7 years ago

1.

Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty

for this error, can you check if url is empty or not?

2.Sometimes elements.size() will be zero, sometimes not. this might be related to max size of body crawler could load, try

Document doc = Jsoup.connect(url)
                .header(headers)
                .userAgent(USER_AGENT)
                .maxBodySize(0)
                .get();