Mahdisadjadi / arxivscraper

A python module to scrape arxiv.org for a date range and category
MIT License
282 stars 53 forks source link

AttributeError: 'NoneType' object has no attribute 'text' #5

Closed digitalimagep closed 4 years ago

digitalimagep commented 5 years ago

site-packages/arxivscraper/arxivscraper.py", line 57, in first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml] AttributeError: 'NoneType' object has no attribute 'text'

Mahdisadjadi commented 5 years ago

@digitalimagep : Can you send me your code to reproduce this? Thanks.

kevingo commented 5 years ago

I also encounter this issue.

import arxivscraper
scraper = arxivscraper.Scraper(category='physics:cond-mat', date_from='2017-05-27',date_until='2017-06-07')
output = scraper.scrape()

output:

http://export.arxiv.org/oai2?verb=ListRecords&from=2017-05-27&until=2017-06-07&metadataPrefix=arXiv&set=physics:cond-mat
fetching up to  1000 records...
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-47bd483a35f6> in <module>
----> 1 output = scraper.scrape()

~/anaconda/envs/word2vec/lib/python3.6/site-packages/arxivscraper/arxivscraper.py in scrape(self)
    168             for record in records:
    169                 meta = record.find(OAI + 'metadata').find(ARXIV + 'arXiv')
--> 170                 record = Record(meta).output()
    171                 if self.append_all:
    172                     ds.append(record)

~/anaconda/envs/word2vec/lib/python3.6/site-packages/arxivscraper/arxivscraper.py in __init__(self, xml_record)
     42         self.updated = self._get_text(ARXIV, 'updated')
     43         self.doi = self._get_text(ARXIV, 'doi')
---> 44         self.authors = self._get_authors()
     45         self.affiliation = self._get_affiliation()
     46 

~/anaconda/envs/word2vec/lib/python3.6/site-packages/arxivscraper/arxivscraper.py in _get_authors(self)
     55         authors_xml = self.xml.findall(ARXIV + 'authors/' + ARXIV + 'author')
     56         last_names = [author.find(ARXIV + 'keyname').text.lower() for author in authors_xml]
---> 57         first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml]
     58         full_names = [a+' '+b for a,b in zip(first_names, last_names)]
     59         return full_names

~/anaconda/envs/word2vec/lib/python3.6/site-packages/arxivscraper/arxivscraper.py in <listcomp>(.0)
     55         authors_xml = self.xml.findall(ARXIV + 'authors/' + ARXIV + 'author')
     56         last_names = [author.find(ARXIV + 'keyname').text.lower() for author in authors_xml]
---> 57         first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml]
     58         full_names = [a+' '+b for a,b in zip(first_names, last_names)]
     59         return full_names

AttributeError: 'NoneType' object has no attribute 'text'
hemanthmayaluru commented 5 years ago

I also got the same error:

AttributeError: 'NoneType' object has no attribute 'text'

Any solution for this?

michaelsok commented 5 years ago

Well, I managed to bypass the error by implementing this temporary fix:

def _get_authors(self):
    authors_xml = self.xml.findall(ARXIV + 'authors/' + ARXIV + 'author')
    last_names, first_names = list(), list()
    for author in authors_xml:
        try:
            last_names.append(author.find(ARXIV + 'keyname').text.lower())
        except AttributeError:
            last_names.append("")
        except Exception as e:
            raise e

        try:
            first_names.append(author.find(ARXIV + 'forenames').text.lower())
        except AttributeError:
            first_names.append("")
        except Exception as e:
           raise e

In fact, it seems that at some point, we stumble upon some author without first name. Thus, I just take into account this case, and append an empty string. However, one should take into account the specific case when the author has no forename, thus aa more formal code should be given, but for research purposes, it seems to be a good temporary fix.

treemantan commented 5 years ago

In fact, I encountered this as well. Possibly you only have to change the sentence in the line 57

try:
     first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml]
except:
     first_names = []

or maybe just put an if-sentence inside, i.e. change the original from

first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml]

to

first_names = [author.find(ARXIV + 'forenames').text.lower() for author in authors_xml if author.find(ARXIV + 'forenames') is not None]
michaelsok commented 5 years ago

Won't: Solution 1: remove all first names in case of an error in a unique author? Solution 2: create first_names and last_names list of different shapes?

But I do think creating an if-else condition is the way to go.

Cerebrock commented 4 years ago

+1

radema commented 4 years ago

I slightly changed the suggestion by @treemantan and fixed the case of empty name in my local installed version. Here's the code, I've used: first_names = [author.find(ARXIV + 'forenames').text.lower() if author.find(ARXIV + 'forenames') is not None else 'n/a' for author in authors_xml ] I prefer to have an 'n/a' as string and replace it later if I need.

Mahdisadjadi commented 4 years ago

PR #9 should have resolved this. Closing the issue.