[Springer] Paper title and journal name

zhugeyicixin commented 5 years ago

It is natural to think the paper title/jounal name is a string rather than a list. And we have discussed it in the PR comments.
There are some weird pages have several paper titles, for example: 10.1007/BF01161620 10.1007/s10230-014-0302-8
Parser needs to be fixed for some journals, for example:

10.1007/s10562-004-3745-x: parsed Journal is ['Catalysis Letters', 'J. Catal.', 'J. Am. Chem. Soc.', 'J. Phys. Chem.', 'Catal. Lett.', 'Angew. Chem. Int. Edn.', 'J. Ind. Rng. Chem.', 'J. Catal.']

10.1007/s11244-005-2883-8: parsed Journal is ['Topics in Catalysis', 'Stud. Surf. Sci. Catal.', 'Appl. Catal. A: General', 'Stud. Surf. Sci. Catal.', 'Top Catal.', 'J. Phys. Chem.', 'Top. Catal.', 'Top. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Brennstoff-Chem.', 'Angew. Chem.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Catalysis Today', 'Fuel Process Technol.', 'Appl. Cat. A: General', 'CIT']

So I think maybe we should:

Change the type of Journal and Title from list to str
Maybe get rid of html files containing several titles if they are useless?
Fix the parser for Journal if we want to keep this field. Since the Journal name is already known during scraping, we could also not parse Journal.

What do you think? @IAmGrootel @hhaoyan

hhaoyan commented 5 years ago

Yes, title and journal name should be either str or None. Will be fixed in the coming versions. This is now being fixed.

hhaoyan commented 5 years ago

Just for tracking:

[x] - RSC
[x] - ECS
[x] - Nature
[x] - Springer
[x] - Wiley
[x] - Elsevier
[x] - ACS

hhaoyan commented 5 years ago

solved

CederGroupHub / LimeSoup

[Springer] Paper title and journal name #38