Open xiayank opened 7 years ago
are you using "copy selector" in chrome?
大家在用jsou的时候会不会总是出现,就算对于同一个界面,同一个css selector,抓到的Element有的时候可以正常工作,取到要抓的东西。但是也有可能有的时候为空,有的时候报错IllegalArgumentException: String must not be empty
?
感觉jsoup不是很稳定,很多时候会失败。
Same here. I thought it was Amazon blocking the crawler...
If so, shouldn't there be 503 error?
@bihjuchiu
In class, John had the exception like org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=http://amazon.com
.
Good point, maybe it's Jsoup problem...
@xiayank can you post the url and selector you are using, also tell me which item you want to crawl? i will take a look
@jygan
https://www.amazon.com/workout-clothes/b/ref=nav_shopall_sa_sp_athclg?ie=UTF8&node=11444071011
https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b/ref=nav_shopall_sa_sp_exfit?ie=UTF8&node=3407731
https://www.amazon.com/Hunting-Fishing-Gear-Equipment/b/ref=nav_shopall_hntfsh?ie=UTF8&node=706813011
https://www.amazon.com/soccer-store-soccer-shop/b/ref=nav_shopall_sa_sp_team?ie=UTF8&node=706809011
https://www.amazon.com/Fan-Shop-Sports-Outdoors/b/ref=nav_shopall_sa_sp_fan?ie=UTF8&node=3386071
https://www.amazon.com/Golf/b/ref=nav_shopall_sa_sp_golf?ie=UTF8&node=3410851
https://www.amazon.com/man-cave/b/ref=nav_shopall_sa_sp_gamerm?ie=UTF8&node=706808011
https://www.amazon.com/Sports-Collectibles/b/ref=nav_shopall_sa_sp_sptcllct?ie=UTF8&node=3250697011
https://www.amazon.com/Sports-Fitness/b/ref=nav_shopall_sa_sp_allsport?ie=UTF8&node=10971181011
https://www.amazon.com/b/ref=nav_shopall_lpd_gno_sports?ie=UTF8&node=12034909011
https://www.amazon.com/camping-hiking/b/ref=nav_shopall_sa_out_camphike?ie=UTF8&node=3400371
https://www.amazon.com/Cycling-Wheel-Sports-Outdoors/b/ref=nav_shopall_sa_out_cyc?ie=UTF8&node=3403201
https://www.amazon.com/Outdoor-Recreation-Clothing/b/ref=nav_shopall_sa_out_outcloth?ie=UTF8&node=11443874011
https://www.amazon.com/skateboarding-scooters-skates/b/ref=nav_shopall_sa_out_scooskate?ie=UTF8&node=11051398011
https://www.amazon.com/water-sports/b/ref=nav_shopall_sa_out_water?ie=UTF8&node=11051399011
https://www.amazon.com/winter-sports/b/ref=nav_shopall_sa_out_wintersport?ie=UTF8&node=2204518011
https://www.amazon.com/climbing/b/ref=nav_shopall_sa_out_climb?ie=UTF8&node=3402401
https://www.amazon.com/outdoor-accessories/b/ref=nav_shopall_sa_out_accout?ie=UTF8&node=11051400011
https://www.amazon.com/outdoor-recreation/b/ref=nav_shopall_sa_out_alloutrec?ie=UTF8&node=706814011
Elements elements = doc.select("span[class=nav-a-content]");
System.out.println(elements.size());
//The element size sometimes equals to zero, sometimes not.
for(int i = 2; i <= elements.size(); i++){
String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")";
Element element = doc.select(css).first();
}
The links of in the menu.
Thanks!
just hard code up to 10 category and catch exception. for(int i = 2; i <= 10; i++){ String css = "#nav-subnav > a:nth-child(" + Integer.toString(i) +")"; Element element = doc.select(css).first(); }
The thing is that next time we crawl the page again. It may work or not. So there will be some products first be crawled, but never be crawl again. Does it affect our project? Since we need to compare the price between the different time we crawl.
we need to crawl the product again even if it's crawled already. your code sometime fail at this line? Element element = doc.select(css).first();
My code has two problems:
1.Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get();
sometimes throw excption IllegalArgumentException: String must not be empty
.
2.Sometimes elements.size()
will be zero, sometimes not.
My solution for the link exploring crawler is to initialize all the url into a queue
. Also, put the url failed into the queue. Quit the loop until there is no element in the queue. At the end, I can get all the links.
But we I design the product detail crawler. It is normal I have the elements.size() = 0
, since not all the sub-category page have the same css selector, because some do not have the product list on the it . I cannot use the queue to do it.
1.
Document doc = Jsoup.connect(url).headers(headers).userAgent(USER_AGENT).timeout(1000000).get(); sometimes throw excption IllegalArgumentException: String must not be empty
for this error, can you check if url is empty or not?
2.Sometimes elements.size()
will be zero, sometimes not.
this might be related to max size of body crawler could load, try
Document doc = Jsoup.connect(url)
.header(headers)
.userAgent(USER_AGENT)
.maxBodySize(0)
.get();
I have a question about using jsop api to select the target element. Here is the HTML. I want to get the
href
attribute value in<a>
tag, which is under the<div class=bxc-grid__image bxc-grid__image--light>
. I tried useto locate the
div
. It works. I followed the APIE > F an F direct child of E
. So the select css will beli[class=sub-categories__list__item]>a
. Howerver, there is exception.Anyone knows how to locate the
<a>
tag?Thanks in advance! Jsoup select API URL OF ORGINAL PAGE
Here is the exception log: