cern-sis / issues-scoap3

0 stars 0 forks source link

Country share 2021/2022 #189

Closed agentilb closed 6 months ago

agentilb commented 10 months ago

Hi,

I would need the same analyse done here: https://github.com/cern-sis/issues-scoap3/issues/72

But on 2021/2022 data.

If possible, I would need the data the last week of August.

You can take the 2022 GDP data from here: API_NY.GDP.MKTP.CD_DS2_en_csv_v2_5728855.csv

ErnestaP commented 10 months ago

@agentilb the country share script results from production. Please take a look, and let us know if is everything as expected.

Russia as a country is on the list because we have an affiliation JINR, which is Russian, so it appears in the results.

Also, you will see that some records have an UNKNOWN country which means, that author doesn't have affiliation info.

results.csv

agentilb commented 10 months ago

Hi @ErnestaP,

Thanks a lot for this!

What criteria did you take for the selection of articles? If I check in the repo, I find 15031 records: https://repo.scoap3.org/search?page=1&size=20&q=&year=2021--2022 and in the csv file, there are only 11925 articles.

Also, this is perfectly normal that we have Russian authors. The new rule applies only to a specific case: Russian authors within a collaboration and identified with the specific string in the affiliation field.

ErnestaP commented 10 months ago

Thank you for checking! I run with --from_year 2021 --to_year 2022, for production. I need to look more closely at why is it like this

ErnestaP commented 10 months ago

Hi @agentilb ! I found out why it was not all records collected: The names of countries in GDP files were slightly different, in comparison with the file from the previous year. It means that some countries were not found in the list and were skipped. The issue is fixed. Uploading the new file. Let me know if the result is as expected :)

results.csv

agentilb commented 10 months ago

Hi @ErnestaP

Thanks a lot! It seems the number of articles is fine now. However, the number of UNKNOWN affiliations seems to be really high.

I have checked a few, I think something is wrong: check https://repo.scoap3.org/records/72513 (10.1007/JHEP09(2022)048 Author is from Korea, and in the file, it is marked as UNKNOWN.

I see also that the column for USA is empty, so there is something that doesn't work.

Could you please check again?

Thanks,

Anne

ErnestaP commented 9 months ago

Hi @agentilb , Thank you a lot for checking. There was a slight issue with mapping, since the new GDP file has different countries' names, in comparison with the last script run. For example: Turkey -> Turkiye

Attaching the latest script result: results.csv

Ernesta

agentilb commented 9 months ago

Thank you Ernesta, I have checked the data and they look coherent! Thanks again!

ErnestaP commented 6 months ago

@agentilb can we close this issue?