FriendsOfPHP / Goutte

Goutte, a simple PHP Web Scraper
MIT License
9.26k stars 1.01k forks source link

Error parsing get url request with year in url #470

Open djpisbionic opened 2 years ago

djpisbionic commented 2 years ago

It seems when the numbers 2020 or 24 are in the url passed to request, it will not return any data, as I believe the url encoding is breaking the url.

after googling for hours, I have tried to pass
$crawler->getQuery()->setEncodingType(false); $crawler->setEncodingType(false);

with no luck

An example of a non working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson+ruby+wave&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

and an example of a working URL is: https://www.ebay.com/sch/i.html?_from=R40&_nkw=2019-20+248+prizm+zion+williamson&_in_kw=1&_ex_kw=&_sacat=0&LH_Sold=1&_udlo=&_udhi=&_samilow=&_samihi=&_sadis=15&_sargn=-1%26saslc%3D1&_salic=1&_sop=12&_dmd=1&_ipg=60&LH_Complete=1&_fosrp=1&LH_Sold=1

i cannot figure for the life of me how to fix this, I have tried urlecoding, decoding, a million different versions of the url, and the only thing different between the 2 url's is the keyword, _nkw param.

Thanks in advance

stof commented 2 years ago

Your 2 URLs are differing due to the _nkw (url-encoded) being either 2019-20+248+prizm+zion+williamson+ruby+wave or 2019-20+248+prizm+zion+williamson. I don't see how this relates to a number 2020

djpisbionic commented 2 years ago

sorry, when I take the 2019-20+248 out of the non-working url, it works. either way, its still a string issue, and i cannot figure out why one would work and not the other

djpisbionic commented 1 year ago

any help?