DiegoCaraballo / Email-extractor

The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url
https://whitemonkey.io
180 stars 75 forks source link

Unknown URL type and sites that hang scraper #7

Closed luluhoc closed 5 years ago

luluhoc commented 5 years ago

Hello, I'm getting this error that stops the program from extracting emails.

unknown url type: 'robert@broofa.com'
Press enter to continue

There is also issue with some sites that hang scraper, I'm not sure if it can be overcome, but here are some examples of the sites maybe you can figure it out from it.


Searching in https://whyy.streamguys1.com/whyy-mp3

Searching in http://www.investor.reuters.com/business/BusCompanyOverview.aspx?t
icker=SCI&symbol=SCI&target=%2fbusiness%2fbuscompany%2fbuscompfake%2fbuscompove
rview

Searching in http://www.accuweather.com/en/us/jersey-city-nj/07306/weather-fore
cast/2735_pc

groupon.com

I'm getting also this error
`[Errno 104] Connection reset by peer`
luluhoc commented 5 years ago

Solution for [Errno 104] Connection reset by peer

https://stackoverflow.com/questions/20568216/python-handling-socket-error-errno-104-connection-reset-by-peer

DiegoCaraballo commented 5 years ago

@luluhoc Hi, add a control for the broken urls in options 1, 2 and 3. The operation is now a bit slower, but I will try to solve it when I switch to objects. regards

luluhoc commented 5 years ago

Thanks, keep up the great work

DiegoCaraballo commented 5 years ago

Hi, add a control for the broken urls in options 1, 2 and 3.

luluhoc commented 5 years ago

The problem still persists. I'm searching for "funeral home new jersey" and I search for 500 results.

I have run updated python script 2 times and the script is hanging on the same url.


Searching in /notices/Alejandro-Hernandez
Searching in /notices/Patrick-Montella
Searching in /notices/Frank-Petrecca
Searching in /notices/Anthony-Ferlazzo
Searching in /notices/Alejandro-Hernandez
Searching in /notices/Patrick-Montella
Searching in /notices/Frank-Petrecca
Searching in /notices/Anthony-Ferlazzo
Searching in /notices/Warren-Vernon
Searching in /notices/Victoria-Rooney
Searching in /notices/Warren-Vernon
Searching in /notices/Victoria-Rooney
Searching in javascript:navigateTo('/mailinglist')
Searching in javascript:navigateTo('/listings')
Searching in /send-flowers
Searching in /mailinglist
Searching in /listings
Searching in /send-flowers
Searching in /our-facilities
Searching in /concierge-services
Searching in http://www.nfda.org/
Searching in http://www.nutleychamber.com/
86 - chamber@nutleychamber.com
87 - info@tempotherapy.com
Searching in https://web.njsfda.org/public/professional-home/about-njsfda/relat
ed-entities/njfds.aspx
Searching in https://web.njsfda.org/public/preplanning/preplanning-a-funeral/ch
eck-trust-balances-and-choices-tax-statements.aspx
Searching in https://www.facebook.com/pages/Biondi-Funeral-Home/154470051254851
Searching in http://www.accuweather.com/en/us/nutley-nj/07110/weather-forecast/
2709_pc
DiegoCaraballo commented 5 years ago

Hello @luluhoc , I'm gonna check it.

luluhoc commented 5 years ago

Thanks

DiegoCaraballo commented 5 years ago

Hello @luluhoc , In the lines 837 and 876 add "timeout = 10" and test. f = urllib.request.urlopen(req, timeout=10)

Later I upload the changes with other fixes   Regards

luluhoc commented 5 years ago

Ok I'll change it and I'll try it out

luluhoc commented 5 years ago

It doesn't work