CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

[Springer] Paper title and journal name #38

Closed zhugeyicixin closed 5 years ago

zhugeyicixin commented 5 years ago
  1. It is natural to think the paper title/jounal name is a string rather than a list. And we have discussed it in the PR comments.

  2. There are some weird pages have several paper titles, for example: 10.1007/BF01161620 10.1007/s10230-014-0302-8

  3. Parser needs to be fixed for some journals, for example:

10.1007/s10562-004-3745-x: parsed Journal is ['Catalysis Letters', 'J. Catal.', 'J. Am. Chem. Soc.', 'J. Phys. Chem.', 'Catal. Lett.', 'Angew. Chem. Int. Edn.', 'J. Ind. Rng. Chem.', 'J. Catal.']

10.1007/s11244-005-2883-8: parsed Journal is ['Topics in Catalysis', 'Stud. Surf. Sci. Catal.', 'Appl. Catal. A: General', 'Stud. Surf. Sci. Catal.', 'Top Catal.', 'J. Phys. Chem.', 'Top. Catal.', 'Top. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Brennstoff-Chem.', 'Angew. Chem.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Catalysis Today', 'Fuel Process Technol.', 'Appl. Cat. A: General', 'CIT']

So I think maybe we should:

  1. Change the type of Journal and Title from list to str
  2. Maybe get rid of html files containing several titles if they are useless?
  3. Fix the parser for Journal if we want to keep this field. Since the Journal name is already known during scraping, we could also not parse Journal.

What do you think? @IAmGrootel @hhaoyan

hhaoyan commented 5 years ago

Yes, title and journal name should be either str or None. Will be fixed in the coming versions. This is now being fixed.

hhaoyan commented 5 years ago

Just for tracking:

hhaoyan commented 5 years ago

solved