SarthakJShetty / pyResearchInsights

End-to-end NLP tool to analyze research publications. Published in Ecology & Evolution 2021.
MIT License
29 stars 8 forks source link

scraper_main -> TypeError: object of type 'NoneType' has no len() #9

Closed carbbin closed 2 months ago

carbbin commented 5 months ago

Hi!

I am working with python 3.11.7.

I created a virtual enviroment and also after the error installed lower versions of beautifulsoup4==4.12.2 and bs4==0.0.1 to see if it was because of this.

It creates me the LOGS folder and i already have the NLTK_DATA folder.

What can be the error?

The error: ##################

[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [INFO]12:20:32 Built LOG folder for session [INFO]12:20:32 https://link.springer.com/search/page/ start_url has been received [INFO]12:20:32 https://link.springer.com/search/page/0?facet-content-type="Article"&query=Western+Ghats+Conservation&facet-language="En" has been obtained Traceback (most recent call last): File "c:\Users\vscode\pyresearch\data.py", line 16, in scraper_main(keywords_to_search, abstracts_log_name, status_logger_name) File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\pyResearchInsights\Scraper.py", line 396, in scraper_main urls_to_scrape = url_generator(start_url, query_string, status_logger_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\pyResearchInsights\Scraper.py", line 65, in url_generator test_soup = bs(url_reader(total_url, status_logger_name), 'html.parser') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\bs4__init.py", line 315, in init__ elif len(markup) <= 256 and (

##################

The code: ################## from pyResearchInsights.common_functions import pre_processing from pyResearchInsights.Scraper import scraper_main

'''Abstracts containing these keywords will be queried from Springer''' keywords_to_search = "Western Ghats Conservation"

'''Calling the pre_processing functions here so that abstracts_log_name and status_logger_name is available across the code.''' abstracts_log_name, status_logger_name = pre_processing(keywords_to_search)

'''Runs the scraper here to scrape the details from the scientific repository''' scraper_main(keywords_to_search, abstracts_log_name, status_logger_name)

##################

SarthakJShetty commented 5 months ago

Hi @carbbin, can you post the full traceback? It looks like you've cut off the error right at the start of the actual error description.

Thanks!

sustianovich commented 5 months ago

Hi @SarthakJShetty,

Yes sorry for that. Here it is:

File "c:\Users.venv\Lib\site-packages\bs4__init.py", line 315, in init__ elif len(markup) <= 256 and ( ^^^^^^^^^^^ TypeError: object of type 'NoneType' has no len()

SarthakJShetty commented 5 months ago

Thank you for reporting this error. Indeed, it looks like there must have been site-wide changes to Elseiver which are preventing the page retrieval. I fear that it may not be possible to even retrieve the HTML anymore. I will try some other ways to retrieve the HTML and get back to you on this. This was also reported in #11

sustianovich commented 5 months ago

Ok, ty very much @SarthakJShetty!

SebastianLeimbacher commented 3 months ago

Hey @SarthakJShetty Did you manage to retrieve the HTML differently and if yes, are you planning to implenting it? I'm looking for a tool like yours atm...

SarthakJShetty commented 3 months ago

Hi @SebastianLeimbacher! Thank you for trying out pyResearchInsights. I'll be taking a look today and try to get back to you. Apologies for the delay, the situation looks to a bit more tricky than I anticipated at first :sweat:

devans18 commented 3 months ago

Hi @SarthakJShetty Do you have any update?

SarthakJShetty commented 2 months ago

Hi @devans18 and @SebastianLeimbacher and @sustianovich

Sorry for the delay, but I've finally figured this out. I will build and post a new package in a new hours and get back on this issue with an update.

SarthakJShetty commented 2 months ago

Thank you for being patient. v1.60 release should solve this issue. Feel free to reopen this page if you still run into this issue.