CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

Olivetti Group - html parsing running errors #3

Closed IAmGrootel closed 6 years ago

IAmGrootel commented 6 years ago

@eddotman @zjensen262 We pulled a branch from master and tried running the ECS parsers using:

from LimeSoup.ECSSoup import ECSSoup 
data = ECSSoup.parse(ECS_htmls[0])

Where ECS_htmls is a list of html strings.

But we get an error:

NameError                                 Traceback (most recent call last)
<ipython-input-6-12ee0748abcb> in <module>()
      1 from LimeSoup.ECSSoup import ECSSoup
----> 2 data = ECSSoup.parse(ECS_htmls[0])

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     50         if not self._next:
     51             raise ValueError("Please provide at least one parsing rule ingredient to the soup")
---> 52         return self._next.parse(html_str)
     53 
     54 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     60 
     61     def parse(self, html_str):
---> 62         results = self._parse(html_str)
     63         if self._next:
     64             results = self._next.parse(results)

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\ECSSoup.pyc in _parse(parser_obj)
    164         # Collect information from the paper using ParserPaper
    165         # Create tag from selection function in ParserPaper
--> 166         parser.deal_with_sections()
    167         obj['Sections'] = parser.data_sections
    168         return {'obj': obj, 'html_txt': parser.raw_html}

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in deal_with_sections(self)
     53         """
     54         parameters = {'name': re.compile('^section_h[0-6]'), 'recursive': False}
---> 55         parse_section = self.create_parser_section(self.soup, parameters, parser_type=self.parser_type)
     56         self.data_sections = parse_section.data
     57         self.headings_sections = parse_section.heading

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in create_parser_section(soup, parameters, parser_type)
     73         :return:
     74         """
---> 75         return ParserSections(soup, parameters, parser_type=parser_type)
     76 
     77     @staticmethod

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_section.py in __init__(self, soup, parameters, debugging, parser_type)
     37             #self.save_soup_to_file('some_thing_wrong_chieldren.html')
     38             warnings.warn(' Some think is wrong in children!=1')
---> 39             exit()
     40         self.soup1 = self.soup1[0]
     41         self.parameters = parameters

NameError: global name 'exit' is not defined

This was tried on the following dois: 10.1149/1.3492151, 10.1149/1.3492174, 10.1149/1.3492188. The html files could be opened on Chrome and looked like it was parsed properly there.

With RSC we were able to run : data = RSCSoup.parse(RSC_htmls[0]) Here we get an issue with empty entries in data. DOI, Journal, and Keywords which are all empty. We tried this on the DOIs: 10.1039/B210215C, 10.1039/B210393C , 10.1039/C000028K

So for the ECS parser we were wondering if this error has a fix. And for the RSC parser we wanted to check in to see if the missing entries is expected behavior or whether we should be attempting to extract that information from htmls.

tiagobotari commented 6 years ago

Hi Alex,

ECS: The paper for ECS are very old and they do not have the Full HTML content. Our parser was build to deal with the Full Text link where we obtain the HTML content. In the case of ECS, we do not store the full page, but we store the relevant part of the webpage, i.e: ''<div class=\"article fulltext-view\">''

RSC: Regards to Journal and Keywords, we store this information in a separate database with the DOI, then do not worry about it. However, you should include the DOI during the parser step. About the content of the page, for RSC we just pass to the parser the: ''<div id=\"wrapper\"> of the full article HTML.''

Anything, let me know. Best,

eddotman commented 6 years ago

Thanks for the clarification!