Closed seasidesparrow closed 7 months ago
The following seems to work in isolation (using BeautifulSoup alone, not ElsevierParser):
from bs4 import BeautifulSoup
def get_groups(soup):
group_list = []
ag = soup.find('ce:author-group').extract()
while ag.find('ce:author-group'):
group_list.append(get_groups(ag))
group_list.append(ag)
return group_list
def main():
with open("cms_omg.xml", "rb") as fc:
data = fc.read()
soup = BeautifulSoup(data, "lxml-xml")
auth_blocks = get_groups(soup)
for a in auth_blocks:
print(a)
print("\n\n")
if __name__ == '__main__':
main()
The solution above has two issues. One, the text content of the first enclosing author-group tag is appended to the end of the group_list object, leading to the pieces being out of order. Two, while the content of the top enclosing author group is of type str
, everything else is of type list
with length 1.
One possible solution:
def get_groups(soup):
group_list = []
ag = soup.find('ce:author-group').extract()
while ag.find('ce:author-group'):
group_list.append(get_groups(ag))
g2 = [ag]
g2.extend(group_list)
group_list = []
for g in g2:
if type(g) == list:
group_list.append(g[0])
else:
group_list.append(g)
return group_list
Fixed by #103
Describe the bug A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself:
<ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>
To Reproduce Parse the test case file
els_phlb_compound_affil.xml
with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.Additional context A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.