Closed agentilb closed 2 years ago
I have compared 1 record from 2020 where the country field was correctly extracted and 1 from 2021 where it was not correctly extracted, it looks that affiliations are stored with the same structure (stored in <aff>)
.
Or can it be the line return that is at the beginning of the field after <label>1</label>
in the 2021 record that creates the error?
https://repo.scoap3.org/records/57853 (2020)
In JSON:
"authors": [
{
"affiliations": [
{
"country": "China",
"value": " \n\t\t\t\t\t School of Physics, Beihang University, Beijing 100191, China"
},
{
"country": "China",
"value": " \n\t\t\t\t\t School of Physics, Southeast University, Nanjing 210094, China"
}
],
"full_name": "Chen, Hua-Xing"
}
],
In the XML file:
<contrib-group content-type="all">
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Chen</surname>
<given-names>Hua-Xing</given-names>
</name>
<name content-type="non-latin-no-space" name-style="eastern">
<surname>陈</surname>
<given-names>华星</given-names>
</name>
<xref ref-type="aff" rid="cpc_44_11_114003_af1">1</xref>
<xref ref-type="aff" rid="cpc_44_11_114003_af2">2</xref>
<xref ref-type="aff" rid="cpc_44_11_114003_em1">†</xref>
</contrib>
<aff id="cpc_44_11_114003_af1">
<label>1</label> School of Physics, Beihang University, Beijing 100191, China
</aff>
<aff id="cpc_44_11_114003_af2">
<label>2</label> School of Physics, Southeast University, Nanjing 210094, China
</aff>
<ext-link ext-link-type="email" id="cpc_44_11_114003_em1" xlink:type="simple">hxchen@buaa.edu.cn</ext-link>
</contrib-group>
https://repo.scoap3.org/records/67419 (2021)
In JSON
"authors": [
{
"affiliations": [
{
"country": "HUMAN CHECK"
}
],
"surname": "Chen",
"email": "muyang@hunnu.edu.cn",
"full_name": "Chen, Muyang",
"given_names": "Muyang"
}
],
In XML
<contrib-group>
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Chen</surname><
given-names>Muyang</given-names>
</name>
<name content-type="non-latin-no-space" name-style="eastern">
<surname>陈</surname>
<given-names>慕阳</given-names>
</name>
<xref ref-type="aff" rid="affiliation01">1</xref>
<email>muyang@hunnu.edu.cn</email>
</contrib>
<aff id="affiliation01">
<label>1</label>
Department of Physics, Hunan Normal University, Changsha 410081, China
</aff>
</contrib-group>
I tried to parse and check what is the difference between JSONs got from parsing (scrapy crawl IOP -a ....) For example:
2014 Link: https://repo.scoap3.org/records/5265 DOI: 10.1088/1367-2630/16/12/125012 Has clean affiliation visible in repo:
"authors": [
{
"raw_name": "Lattanzi, Massimiliano",
"affiliations": [
{
"country": "Italy",
"value": "Dipartimento di Fisica e Scienza della Terra, Universit\u00e1 di Ferrara and INFN sezione di Ferrara, Polo Scientifico e Tecnologico\u2014Edificio C Via Saragat, 1, I-44122 Ferrara, Italy"
}
],
"surname": "Lattanzi",
"given_names": "Massimiliano",
"full_name": "Lattanzi, Massimiliano"
},
{
"raw_name": "Lineros, Roberto A",
"affiliations": [
{
"country": "Spain",
"value": "Instituto de F\u00edsica Corpuscular\u2014CSIC/Universitad de Valencia, Parc Cient\u00edfic, calle Catedr\u00e1tico Jos\u00e9 Beltr\u00e1n, 2, E-46980 Paterna, Spain"
}
],
"surname": "Lineros",
"given_names": "Roberto A",
"full_name": "Lineros, Roberto A"
},
{
"raw_name": "Taoso, Marco",
"affiliations": [
{
"country": "France",
"value": "Institut de Physique Th\u00e9orique, CNRS, URA 2306 CEA/Saclay, F-91191 Gif-sur-Yvette, France"
}
],
"surname": "Taoso",
"given_names": "Marco",
"full_name": "Taoso, Marco"
}
],
Affiliation (json) output from scrapy crawl :
{
"authors":[
{
"affiliations":[
{
"value":"u"", Dipartimento di Fisica e Scienza della Terra, Universit\\xe1 di Ferrara and INFN sezione di Ferrara, , Polo Scientifico e Tecnologico\u2014Edificio C Via Saragat, 1, I-44122 Ferrara, , Italy,"
}
],
"surname":"u""Lattanzi",
"given_names":"u""Massimiliano",
"full_name":"u""Lattanzi, Massimiliano"
},
{
"affiliations":[
{
"value":"u"", Instituto de F\\xedsica Corpuscular\u2014CSIC/Universitad de Valencia, Parc Cient\\xedfic, , calle Catedr\\xe1tico Jos\\xe9 Beltr\\xe1n, 2, E-46980 Paterna, , Spain,"
}
],
"surname":"u""Lineros",
"given_names":"u""Roberto A",
"full_name":"u""Lineros, Roberto A"
},
{
"affiliations":[
{
"value":"u"", Institut de Physique Th\\xe9orique, CNRS, , URA 2306 CEA/Saclay, F-91191 Gif-sur-Yvette, , France,"
}
],
"surname":"u""Taoso",
"given_names":"u""Marco",
"full_name":"u""Taoso, Marco"
}
],
"titles":[
{
"source":"IOP",
"subtitle":"",
"title":"u""Connecting neutrino physics with dark matter"
}
],
"publication_info":[
{
"journal_volume":"u""16",
"material":"article",
"page_end":"",
"artid":"u""125012",
"journal_title":"u""New Journal of Physics",
"pubinfo_freetext":"",
"page_start":"",
"year":2014,
"journal_issue":"u""12"
}
],
"copyright":[
{
"material":"",
"holder":"",
"statement":"u""\\xa9 2014 IOP Publishing Ltd and Deutsche Physikalische Gesellschaft",
"year":"u""2014"
}
],
"collections":[
{
"primary":"Chinese Physics C"
}
],
"files":[
],
"local_files":[
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_3tvafJ/oup_test/125012.xml",
"filetype":"xml"
}
},
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_3tvafJ/oup_test/125012.pdf",
"filetype":"pdf"
}
}
],
"dois":[
{
"value":"u""10.1088/1367-2630/16/12/125012"
}
],
"page_nr":[
19
],
"acquisition_source":{
"date":"2022-03-21T14:11:08.007889",
"source":"IOP",
"method":"IOP",
"submission_number":""
},
"license":[
{
"url":"u""Creative Commons Attribution 3.0 licence",
"license":""
}
],
"record_creation_date":"2022-03-21T14:11:08.007896",
"abstracts":[
{
"source":"IOP",
"value":"u""The origin of neutrino masses and the nature of dark matter are two in most pressing open questions in modern astro-particle physics. We consider here the possibility that these two problems are related, and review some theoretical scenarios which offer common solutions. A simple possibility is that the dark matter particle emerges in minimal realizations of the seesaw mechanism, as in the majoron and sterile neutrino scenarios. We present the theoretical motivation for both models and discuss their phenomenology, confronting the predictions of these scenarios with cosmological and astrophysical observations. Finally, we discuss the possibility that the stability of dark matter originates from a flavor symmetry of the leptonic sector. We review a proposal based on an A$_{4}$ flavor symmetry."
}
],
"imprints":[
{
"date":"2014-12-22",
"publisher":"IOP"
}
]
}
2020 Link: https://repo.scoap3.org/records/57853 DOI: 10.1088/1674-1137/abae4b Has Affiliation visible in repo but with tabs and new lines (\t and \n):
"authors": [
{
"affiliations": [
{
"country": "China",
"value": " \n\t\t\t\t\t School of Physics, Beihang University, Beijing 100191, China"
},
{
"country": "China",
"value": " \n\t\t\t\t\t School of Physics, Southeast University, Nanjing 210094, China"
}
],
"full_name": "Chen, Hua-Xing"
}
],
Affiliation (json) output from scrapy crawl :
"authors":[
{
"affiliations":[
{
"value":""
}
],
"surname":"u""Chen",
"given_names":"u""Hua-Xing",
"full_name":"u""Chen, Hua-Xing"
}
],
"titles":[
{
"source":"IOP",
"subtitle":"",
"title":"u""Decay properties of the (3900) through the Fierz rearrangement "
}
],
"publication_info":[
{
"journal_volume":"u""44",
"material":"article",
"page_end":"",
"artid":"u""114003",
"journal_title":"u""Chinese Physics C",
"pubinfo_freetext":"",
"page_start":"",
"year":2020,
"journal_issue":"u""11"
}
],
"copyright":[
{
"material":"",
"holder":"",
"statement":"u""\\xa9 2020Chinese Physical Society and the Institute of High Energy Physics of the Chinese Academy of Sciences and the Institute of Modern Physics of the Chinese Academy of Sciences and IOP Publishing Ltd",
"year":"u""2019"
}
],
"collections":[
{
"primary":"Chinese Physics C"
}
],
"files":[
],
"local_files":[
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_I8WXPc/oup_test/abae4b.xml",
"filetype":"xml"
}
},
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_I8WXPc/oup_test/abae4b.pdf",
"filetype":"pdf"
}
}
],
"dois":[
{
"value":"u""10.1088/1674-1137/abae4b"
}
],
"page_nr":[
18
],
"acquisition_source":{
"date":"2022-03-21T14:10:19.806459",
"source":"IOP",
"method":"IOP",
"submission_number":""
},
"license":[
{
"url":"u""Creative Commons Attribution 3.0 licence",
"license":""
}
],
"record_creation_date":"2022-03-21T14:10:19.806469",
"abstracts":[
{
"source":"IOP",
"value":"u""We systematically construct all the tetraquark currents/operators of J$^{PC}$ = 1$^{+-}$ with the quark configurations $[cq][\\bar c \\bar q]$ , $[cq][\\bar c \\bar q]$ , and $[cq][\\bar c \\bar q]$ ( $[cq][\\bar c \\bar q]$ ), and derive their relations through the Fierz rearrangement of the Dirac and color indices. Using the transformations of $[cq][\\bar c \\bar q]$ and $[cq][\\bar c \\bar q]$ , we study decay properties of the $[cq][\\bar c \\bar q]$ as a compact tetraquark state; while using the transformation of $[cq][\\bar c \\bar q]$ , we study its decay properties as a hadronic molecular state."
}
],
"imprints":[
{
"date":"2020-11-01",
"publisher":"IOP"
}
]
}
2021: Link: https://repo.scoap3.org/records/67419 DOI: 10.1088/1674-1137/ac2a1a Affiliations visible in repo:
"authors": [
{
"affiliations": [
{
"country": "HUMAN CHECK"
}
],
"surname": "Chen",
"email": "muyang@hunnu.edu.cn",
"full_name": "Chen, Muyang",
"given_names": "Muyang"
}
],
Affiliation (json) output from scrapy crawl :
{
"authors":[
{
"affiliations":[
{
"value":""
}
],
"surname":"u""Chen",
"given_names":"u""Muyang",
"full_name":"u""Chen, Muyang",
"email":"u""muyang@hunnu.edu.cn"
}
],
"titles":[
{
"source":"IOP",
"subtitle":"",
"title":"u""Radial excited heavy mesons "
}
],
"publication_info":[
{
"journal_volume":"u""45",
"material":"article",
"page_end":"",
"artid":"u""123104",
"journal_title":"u""Chinese Physics C",
"pubinfo_freetext":"",
"page_start":"",
"year":2021,
"journal_issue":"u""12"
}
],
"copyright":[
{
"material":"",
"holder":"",
"statement":"u""\\xa9 2021Chinese Physical Society and the Institute of High Energy Physics of the Chinese Academy of Sciences and the Institute of Modern Physics of the Chinese Academy of Sciences and IOP Publishing Ltd",
"year":"u""2021"
}
],
"collections":[
{
"primary":"Chinese Physics C"
}
],
"files":[
],
"local_files":[
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_I8WXPc/oup_test/ac2a1a.xml",
"filetype":"xml"
}
},
{
"value":{
"path":"/tmp/IOP/unpacked/oup_test_I8WXPc/oup_test/ac2a1a.pdf",
"filetype":"pdf"
}
}
],
"dois":[
{
"value":"u""10.1088/1674-1137/ac2a1a"
}
],
"page_nr":[
6
],
"acquisition_source":{
"date":"2022-03-21T14:10:19.837813",
"source":"IOP",
"method":"IOP",
"submission_number":""
},
"license":[
{
"url":"u""Creative Commons Attribution 3.0 licence",
"license":""
}
],
"record_creation_date":"2022-03-21T14:10:19.837821",
"abstracts":[
{
"source":"IOP",
"value":"u""In this study, the first radial excited heavy pseudoscalar and vector mesons ( $\\eta_c(2S)$ , $\\eta_c(2S)$ , $\\eta_c(2S)$ , $\\eta_c(2S)$ , $\\eta_c(2S)$ , and $\\eta_c(2S)$ ) are investigated using the Dyson-Schwinger equation and Bethe-Salpeter equation approach. It is shown that the effective interactions of the radial excited states are harder than those of the ground states. With the interaction well determined by fitting the masses and leptonic decay constants of $\\eta_c(2S)$ and $\\eta_c(2S)$ , the first radial excited heavy mesons could be quantitatively described in the rainbow ladder approximation. The masses and leptonic decay constants of $\\eta_c(2S)$ , $\\eta_c(2S)$ , $\\eta_c(2S)$ , and $\\eta_c(2S)$ are predicted."
}
],
"imprints":[
{
"date":"2021-12-01",
"publisher":"IOP"
}
]
}
Summary: scrapy returns empty affiliations.value for articles from 2020 and 2021. However, 2020 somehow has affiliations.value parsed and visible in repo (country name instead of HUMAN CHECK), but not clean- starts with a new line and 5 tabs. Articles, from 2014 has clean affiliations.value and also in XML country is put in
@agentilb I've checked what is the oldest file we have in our server from IOP, and sadly, it's from 2021. Can we ask IOP to send us and upload a file to their FTP from 2020 for testing purposes? I would like to harvest it again to make sure that we will get the same result as we are having right now: JSON without affiliation and visible affiliation in the repo. For now, I don't understand how the country name was filled in the affiliation field.
@ErnestaP I will ask IOP to provide samples of 2020 articles.
2 articles from 2021 have country fields (partially): https://repo.scoap3.org/search?page=1&size=20&journal=Chinese%20Physics%20C&year=2021--2022&country=usa Maybe it is worth to test the harvest for those 2 articles as well.
The article https://repo.scoap3.org/records/67454 has really messy JSON data where is countries. Even though it has partly parsed fields, they are wrong. For example, according to XML, F.Baryshnikov should be assigned with Russia, while it is put the USA.
<contrib contrib-type="author" xlink:type="simple">
<name name-style="western">
<surname>Baryshnikov</surname>
<given-names>F.</given-names>
</name>
<xref ref-type="aff" rid="affiliation82">82</xref>
</contrib>
<aff id="affiliation82">
<label>82</label>
National University of Science and Technology “MISIS”, Moscow, Russia, associated to
<sup>41</sup>
</aff>
<aff id="affiliation41">
<label>41</label>
Institute of Theoretical and Experimental Physics NRC Kurchatov Institute (ITEP NRC KI), Moscow, Russia
</aff>
I got an answer from IOP:
tested in qa, changes are working :)
DOI: 10.1088/1674-1137/ac5010 PROD: https://repo.scoap3.org/records/69508 QA: https://repo.qa.scoap3.org/records/61744
DOI: 10.1088/1674-1137/ac500e PROD: https://repo.scoap3.org/records/69507 QA: https://repo.qa.scoap3.org/records/61743
@ErnestaP isn't it deployed on PROD? Can we close it?
yes, it is. For example, you can see correctly parsed countries here: PROD: https://repo.scoap3.org/records/69507 QA: https://repo.qa.scoap3.org/records/61743
I noticed that many IOP articles have in the Country field:"HUMAN CHECK" (field
authors.affiliations.country
) Ex: https://repo.scoap3.org/records/67376The country should be derived from the affiliation field in the metadata received from the publisher.
For example, I see that in the XML, the affiliations seem to be present: https://repo.scoap3.org/api/files/cf6b5c5e-a682-437c-bdf8-c12aeb8c9c0d/10.1088/1674-1137/abd16d.xml
Is there an issue with our conversion script?