WolfgangFahl / ConferenceCorpus

ScientificEventCorpus
Apache License 2.0
1 stars 2 forks source link

dblp xml parser skips some proceedings titles #5

Open WolfgangFahl opened 3 years ago

WolfgangFahl commented 3 years ago
select  key,conf,year,booktitle, title
from proceedings 
where title is null 
order by year

shows 188 entries with empty titles

key conf year booktitle title
conf/ir/1980 ir 1980 Textverarbeitung und Informatik  
conf/eurographics/1986 eurographics 1986 Eurographics  
conf/imycs/1986 imycs 1987 IMYCS  
conf/tapsoft/1989-2 tapsoft 1989 TAPSOFT, Vol.2  
conf/ifip5-3/1992 ifip5-3 1992 Manufacturing in the Era of Concurrent Engineering  
conf/tools/12-1993 tools 1993 TOOLS (12/9)  
conf/echt/94 echt 1994    
conf/criwg/1995 criwg 1995 CRIWG  
conf/gis/95 gis 1995    
WolfgangFahl commented 3 years ago
grep 'key="conf/ir/1980"' -A12 dblp.xml
<proceedings mdate="2018-06-23" key="conf/ir/1980">
<editor>Peter R. Wossidlo</editor>
<title>Textverarbeitung und Informatik, Fachtagung der GI, Bayreuth, Deutschland, 28.-30. Mai 1980</title>
<booktitle>Textverarbeitung und Informatik</booktitle>
<series href="db/series/ifb/index.html">Informatik-Fachberichte</series>
<volume>30</volume>
<publisher>Springer</publisher>
<year>1980</year>
<isbn>3-540-10148-9</isbn>
<url>db/conf/ir/text1980.html</url>
<ee>https://doi.org/10.1007/978-3-642-67700-7</ee>
</proceedings>
WolfgangFahl commented 3 years ago
empty title for 616{'mdate': '2019-05-14', 'key': 'conf/pfe/2001', 'editor': 'Frank van der Linden 0001', 'title': None, 'booktitle': 'PFE', 'series': 'Lecture Notes in Computer Science', 'volume': '2290', 'publisher': 'Springer', 'year': '2002', 'isbn': '3-540-43659-6', 'ee': 'https://doi.org/10.1007/3-540-47833-7', 'url': 'db/conf/pfe/pfe2001.html', 'conf': 'pfe'}
empty title for 789{'mdate': '2019-01-26', 'key': 'conf/hpcasia/2019', 'title': None, 'publisher': 'ACM', 'booktitle': 'HPC Asia', 'year': '2019', 'isbn': '978-1-4503-6632-8', 'ee': 'https://dl.acm.org/citation.cfm?id=3293320', 'url': 'db/conf/hpcasia/hpcasia2019.html', 'conf': 'hpcasia'}
<proceedings mdate="2019-05-14" key="conf/pfe/2001">
<editor>Frank van der Linden 0001</editor>
<title>Software Product-Family Engineering, 4th International Workshop, PFE 2001, Bilbao, Spain, October 3-5, 2001, Revised Papers</title>
<booktitle>PFE</booktitle>
<series href="db/series/lncs/index.html">Lecture Notes in Computer Science</series>
<volume>2290</volume>
<publisher>Springer</publisher>
<year>2002</year>
<isbn>3-540-43659-6</isbn>
<ee>https://doi.org/10.1007/3-540-47833-7</ee>
<url>db/conf/pfe/pfe2001.html</url>
</proceedings>
<proceedings mdate="2019-01-26" key="conf/hpcasia/2019">
<title>Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2019, Guangzhou, China, January 14-16, 2019</title>
<publisher>ACM</publisher>
<booktitle>HPC Asia</booktitle>
<year>2019</year>
<isbn>978-1-4503-6632-8</isbn>
<ee>https://dl.acm.org/citation.cfm?id=3293320</ee>
<url>db/conf/hpcasia/hpcasia2019.html</url>
</proceedings>
WolfgangFahl commented 3 years ago

see fine print in https://dblp.org/faq/16154937.html

WolfgangFahl commented 3 years ago
wc -l dblp.xml 
78745438 dblp.xml
WolfgangFahl commented 3 years ago

see also https://bugs.launchpad.net/lxml/+bug/1742121 - sourceline 65535

WolfgangFahl commented 3 years ago
sed -n '19209395,19400000p;19400000q' dblp.xml > snippet.xml

...
lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 166, column 27

grep "&eacute;" dblp.xml | wc -l
     931