adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Elsevier parser is mishandling author groups embedded in author groups #102

Closed seasidesparrow closed 3 months ago

seasidesparrow commented 4 months ago

Describe the bug A recent PhLB paper from the CMS collaboration has author-affiliation metadata for each institution author-group embedded within a master author-group for the collaboration itself: <ce:author-group>...CMS Collab...<ce:author-group><ce:author>authors at Institution 1</ce:author><sa:affiliation>Institution 1</sa:affiliation></ce:author-group>....etc...</ce:author-group>

To Reproduce Parse the test case file els_phlb_compound_affil.xml with release v0.9.17 of ADSIngest Parser. Each author will be assigned all affiliations present, rather than those within the nested author group they belong to.

Additional context A recursive find-extract in beautifulsoup that extracts author-group tags while soup.find('author-group') is not None may be able to do this, but you have to make sure you recursively parse each author-group found to see if it too has any author-groups.

seasidesparrow commented 4 months ago

The following seems to work in isolation (using BeautifulSoup alone, not ElsevierParser):

from bs4 import BeautifulSoup

def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    group_list.append(ag)
    return group_list

def main():
    with open("cms_omg.xml", "rb") as fc:
        data = fc.read()

    soup = BeautifulSoup(data, "lxml-xml")

    auth_blocks = get_groups(soup)
    for a in auth_blocks:
        print(a)
        print("\n\n")

if __name__ == '__main__':
    main()
seasidesparrow commented 4 months ago

The solution above has two issues. One, the text content of the first enclosing author-group tag is appended to the end of the group_list object, leading to the pieces being out of order. Two, while the content of the top enclosing author group is of type str, everything else is of type list with length 1.

One possible solution:

def get_groups(soup):
    group_list = []
    ag = soup.find('ce:author-group').extract()
    while ag.find('ce:author-group'):
        group_list.append(get_groups(ag))
    g2 = [ag]
    g2.extend(group_list)
    group_list = []
    for g in g2:
        if type(g) == list:
            group_list.append(g[0])
        else:
            group_list.append(g)
    return group_list
seasidesparrow commented 3 months ago

Fixed by #103