PBrockmann / PANGAEA_Scraping

web scraping from the OA-ICC PANGAEA
MIT License
1 stars 2 forks source link

Figure on the country of first author #6

Closed Yan-yang35 closed 10 months ago

Yan-yang35 commented 2 years ago

Hi Patrick, I saw the bar plots of country of first author, thanks a lot! Fred and I decided to show it on map before. Following are the maps I made by R before, (a) papers from which data were archived and (b) papers for which data could not be obtained. It would be great if you could also make it by Python and publish in the Jupyter notebook. This is not the priority task at this moment, since I already made it by R. And there is an issue of the number of China paper, I checked the "OAICCdb" file, only 130 papers from China, but in the bar plot the number is more than 300. image

PBrockmann commented 2 years ago

I agree that it would nice and coherent to have all article plots available from a notebook (or several). Here is a first try for this kind of map. This type of representation is not easy to analyze precisely. Some overlays appear in Europe for example.

image

I haven't separated yet numbers between before and after 2015. Some informations also may not be correctly located. For example UK (Bermuda)

PBrockmann commented 2 years ago

Here are the numbers found (only those that are >1):

USA                        573
Australia                  317
China                      308
Germany                    278
UK                         207
Portugal                    99
Spain                       92
Italy                       91
Japan                       85
France                      82
Canada                      59
New Zealand                 56
Sweden                      55
Norway                      51
Chile                       49
South Korea                 34
Brazil                      29
Monaco                      25
Israel                      24
Belgium                     23
India                       23
The Netherlands             22
Greece                      15
Finland                     13
Malaysia                    11
China, Hong Kong            10
Denmark                      8
China, Taiwan                8
Mexico                       8
South Africa                 7
Turkey                       7
Argentina                    7
UK (Bermuda)                 6
Poland                       6
Indonesia                    5
Philippines                  5
Estonia                      4
Panama                       4
Egypt                        4
Kingdom of Saudi Arabia      4
Brunei                       3
Switzerland                  3
Austria                      2
Thailand                     2
Ireland                      2
Kuwait                       2
Namibia                      2
Korea                        2
New Caledonia                2            
Yan-yang35 commented 2 years ago

Hi Patrick,thanks for the map. But we need two maps, one shows the number of papers from which data were archived and the other shows the number of papers for which data could not be obtained. For the first one, please count the number from the file "OAICCdb_101121CG" and the second one, please count the number from the file "OAICCnoanswer_101121CG". These are the latest file which were exported from the bibliodatabase. In these maps, no need to separate numbers between before and after 2015. I have googled the latitude of The Islands of Bermuda is 32.299507, and the longitude is -64.790337. Can we use this coordinate to locate UK (Bermuda)?

OAICCdb_101121CG.xlsx OAICCnoanswer_101121CG.xlsx

PBrockmann commented 2 years ago

Hi Yan, Ok I will work from those 2 files I didn't know before. Yes I know how to retrieve a longitude, latitude from "UK (Bermuda)", just I do this from a program not manually. And "UK (Bermuda)" is not the standard name for this island that would rather be "Bermuda". Once again, the idea is to avoid manual fixes...

PBrockmann commented 2 years ago

And there is an issue of the number of China paper, I checked the "OAICCdb" file, only 130 papers from China, but in the bar >plot the number is more than 300. This difference is due to the fact that I have used the file http://oa-icc.ipsl.fr/checking/OAICC_20211012.csv that groups all the different papers (included, and not included for different reasons). If I produced a map for "Included papers" I find 122 for China and for "Not included papers" 186 for China.

The script that produces the following maps is available from https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_Country_affiliation_01.ipynb

image

image

Yan-yang35 commented 2 years ago

Following are some issues:

  1. Please update the http://oa-icc.ipsl.fr/checking/OAICC_20211012.csv based on the files which were exported from the bibliodatabase on Nov. 10. Please find them in the attachment.
  2. For the map of papers not included, please only count the papers not included for "No answer from authors" to show how many papers which could not be obtained from the authors
  3. Please remove code "b = a[a >= 5]" to show all the countries/regions in the maps. Export 20211110.zip
PBrockmann commented 2 years ago

I have updated the notebook taking into account your issues 2 and 3. Not sure that maps are clear with all circles overlayed. Please have a look to https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_Country_affiliation_01.ipynb

image image

Please could you correct your exported files by respecting the naming and the format previously used. OAICCdatalost_20211012.utf8 OAICCdb_20211012.utf8 OAICCincomplete_20211012.utf8 OAICCnoanswer_20211012.utf8

The separator is ';' This is not the case with :

Files must be encoded in utf8. This is not the case. Please try to valide your export with https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/OAICC_ReadExport.ipynb

PBrockmann commented 2 years ago

I also see an issue in the 2nd map where I find

Should be clarified.

Yan-yang35 commented 2 years ago

I am asking Carolina to change "Republic of Korea" and "Korea" into "South Korea" in the biblio. I will send you the latest utf8 when it is done.
The map looks good to me, I will check with Fred about the circle overlap issue. The location of "China-Taiwan" is wrong, the latitude could be 23.802040, logitude could be 120.991650.

PBrockmann commented 2 years ago

Please correct this kind of information from your file. It is not convenient to fix this manually.

"China-Taiwan" ? Do not see this label . I have "China, Taiwan" and the geopy module is able to locate it correctly.

Yan-yang35 commented 2 years ago

Sorry, what I mentioned is "China, Taiwan". Following figure shows the right location of it, could you please check why geopy module is locate it in the wrong place? It need to be revised, thanks. image

PBrockmann commented 2 years ago

"China, Taiwan" gives the position indicated that is indeed incorrect. But "Taiwan" gives a correct position.

I can correct these incorrect names manually but standard names should have been used to avoid such classic problems or better the ISO 3166 alpha3. Could you correct this from your side by using this ISO 3166 alpha3 ? Please refer to https://www.iso.org/iso-3166-country-codes.html I think you understand that I would like to avoid manual corrections and based production of plots from standard ways using standard names from international conventions well suited for that.

Here is a correction https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_Country_affiliation_01.ipynb image

Yan-yang35 commented 2 years ago

Thanks Patrick for your suggestion! I am checking with Carolina who is maintaining the OA-ICC Bibliodatabase if it is ok to change the country name into ISO 3166 alpha 3. Will keep you update if any reply from her.

Yan-yang35 commented 2 years ago

I have discussed with Fred and Carolina about the ISO 3166 alpha 3. We agree this is a good idea but can't realize in a short time, because it will be a lot of work for Carolina to make the changes in the whole bibliodatabase and she will be on vocation from 11 Dec to 18 Jan. We are going to submit the ESSD paper before Christmas, so at this moment we still need your help to make the corrections manually. But we will keep this in mind.

About the overlap issue, Fred suggested:

  1. Delete the map showing papers with no answers, he don't want to offense the countries which are not contributed to this database;
  2. In the world map showing papers included, show the total number of Europ paper as one circle, no need to show the circles of each European countries
  3. Make an map of Europ to show the circles of papers included for each European Countries. In this way, we can enlarge the area of Europ and avoid circles overlap in this area.

Attached please find the latest export files, please kindly update the map of paper included base on "OAICCdb_20211203" Export 20211208.zip .

PBrockmann commented 2 years ago

Notebook updated. https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_Country_affiliation_01.ipynb

1) Just do not include the map 2) Done 3) Done

image

image

Yan-yang35 commented 2 years ago

Thanks a lot!Only one issue, the circle for Turkey is missing in the Europ map, could you add it? Many thanks.

PBrockmann commented 2 years ago

Just had to change the limits of the map. image

Yan-yang35 commented 2 years ago

Great, thanks!I just found some references are missing in the OAICCdb file, I am asking Carolina to update this file. I will send you the updated version later.

Yan-yang35 commented 2 years ago

Unfortunately Carolina is not able to update the OAICCdb file now. She can only do it when she come back from holiday on the 17 Jan. I added 12 papers "included" in the attached CSV file which you produced from "exported files" manually. Could you please update the maps and the figure of cumulative number of papers based on the latest version of CSV file? Thanks a lot! OAICC_20211210.csv

PBrockmann commented 2 years ago

I do not have time today... Please could you check the encoding of your file. It should be encoded with 'utf_8'. For now, the reading raises an error complaining that UnicodeDecodeError: 'utf-8' codec can't decode bytes ...

Try to read this file by:

import pandas as pd
df = pd.read_csv('OAICC_20211210.csv', encoding = 'utf_8')
Yan-yang35 commented 2 years ago

OAICC_20211210.csv I changed the file to be encoded with 'utf_8'. Please kindly try again, thanks.

PBrockmann commented 2 years ago

Done from your file. Do I have to expose it from the oa-icc.ipsl.fr ? Does it replace the export.zip files (4 files) you produced before ?

image

image