CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

['obj']['Sections'] contains None #27

Closed hhaoyan closed 5 years ago

hhaoyan commented 5 years ago

Sometimes parser returns an data['obj']['Sections'] with a None in it. IMO None should not be in the list and be removed in the coming versions.

For example:

html_str = """<div id="wrapper"><div class="left_head"><a class="simple" href="http://pubs.rsc.org"><img class="rsc-logo" border="0" src="http://pubs.rsc.org/content/NewImages/royal-society-of-chemistry-logo.png" alt="Royal Society of Chemistry"></a><br><span class="btnContainer"><a class="btn btn--tiny btn--primary" target="_blank" title="Link to PDF version" href="http://pubs.rsc.org/en/content/articlepdf/2012/CC/C1CC90183D">View PDF Version</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to previous article (id:C1CC90192C)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90192C" target="_BLANK">Previous Article</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to next article (id:C1CC90182F)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90182F" target="_BLANK">Next Article</a></span></div><div class="right_head"> </div><div class="article_info"> DOI: <a target="_blank" title="Link to landing page via DOI" href="https://doi.org/10.1039/C1CC90183D">10.1039/C1CC90183D</a>
(Editorial)
<span class="italic"><a title="Link to journal home page" href="https://doi.org/10.1039/1364-548X/1996">Chem. Commun.</a></span>, 2012, <strong>48</strong>, 18-18</div><h1 id="sect127"><span class="title_heading">A message from the new <span class="italic">ChemComm</span> chair</span></h1><p class="header_text">
      <span class="bold">

            Richard R. 
            Schrock

      </span>
    </p><div id="art-admin"><table><tbody><tr><td class="biogPlate"><img alt="" src="http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2012/CC/c1cc90183d/c1cc90183d-p1.gif"><b></b><p><b>Richard R. Schrock</b></p></td><td><i></i><p>Richard R. Schrock received his PhD in inorganic chemistry from Harvard in 1971. After spending one year as an NSF postdoctoral fellow at the University of Cambridge and three years at the Central Research and Development Department of E. I. DuPont de Nemours and Co., he moved to M.I.T. in 1975 where he became full professor in 1980 and the Frederick G. Keyes Professor of Chemistry in 1989. His interests include the inorganic and organometallic chemistry of early transition metals and catalytic processes involving them. In 2005 he shared the Nobel Prize in chemistry with Robert Grubbs and Yves Chauvin for the “development of the metathesis method in organic synthesis.”</p></td></tr></tbody></table><hr>

      <span>I accepted the position of <span class="italic">ChemComm</span> Editorial Board Chair with honour and pride in the summer of 2011. Steeped in history, <span class="italic">ChemComm</span> continues to be one of the leading journals for important and urgent research across all chemical disciplines. It was largely because of the journal's standing in the chemical community that I agreed to take the role and lead the Editorial Board for the next four years. In this brief message, I would like to layout my vision for <span class="italic">ChemComm</span> from 2012.</span>
      <p class="otherpara">First, I want to thank Professor Peter Kündig, University of Geneva, who retires from the Chairman's role at the end of 2011. In his four years as Chair, <span class="italic">ChemComm</span> has seen its impact factor rise year on year while the number of articles published has increased by 50%; this is a truly remarkable achievement. I hope to be able to look back on similarly impressive results in four years time. Thank you Peter for your leadership, vision and energy.</p>
      <p class="otherpara">Looking to the future, 2012 will be a landmark year for <span class="italic">ChemComm</span>. Starting in January the journal will publish 100 issues per year. <span class="italic">ChemComm</span> will be the first chemistry journal to achieve such a remarkable feat. The journal will be hitting your desks twice a week, with each issue packed with a mixture of high quality communications and reviews. This doubling in frequency is a consequence of the significant growth of the journal, with annual submissions now close to 8000. The most rapid growth is in the number of submissions from Asia, in particular China, where <span class="italic">ChemComm</span> is both well known and popular. We hope to maintain these links with Asia while ensuring we continue to build strong support from other key countries that are leading the way in chemical research.</p>
      <p class="otherpara">Most importantly, we will continue to focus on further improving the quality of the journal through vigorous and fair peer review. Marshalled by our Associate Editors, who are all world-renowned scientists, and the dedicated professional Editors based in Cambridge, UK, we will strive to deliver the very best customer service at a speed that sets <span class="italic">ChemComm</span> apart from its competitors.</p>
      <p class="otherpara">In summary, I am very much looking forward to working with the Editorial Board and steering the journal through this exciting period of its life. On behalf of the Editorial Board, I would like to thank all our referees and authors who continue to contribute to the journal’s success.</p>
      <p class="otherpara">Richard R. Schrock</p>
      <p class="otherpara">F. G. Keyes Professor of Chemistry</p>
      <p class="otherpara">Editorial Board Chair, <span class="italic">ChemComm</span></p>

  <table><tbody><tr><td><hr></td></tr><tr><td><b>This journal is © The Royal Society of Chemistry 2012</b></td></tr></tbody></table></div></div>"""

from LimeSoup.RSCSoup import RSCSoup
RSCSoup.parse(html_str)
# Gives:
# {'obj': {'DOI': '', 'Title': ['A message from the new ChemComm chair'], 'Keywords': [], 'Journal': [], 'Sections': [None]}, 'html_txt': '<section_h1>\n</section_h1>'}
hhaoyan commented 5 years ago

fixed 8e2edb66f57080c1a9c9c573ce5e48969c34f030