brieaspasia / mp-diagnostics

Initial analysis of marine pollution research
0 stars 0 forks source link

mp-mapping #5

Open brieaspasia opened 4 years ago

brieaspasia commented 4 years ago

Create a spatial map of global mp research using tmap library

Data I have isolated the MostProdCountries df from the biblioAnalysis summary and will use the Articles field for symbology

mlagisz commented 4 years ago

S <- summary(object = results, k = 50, pause = FALSE) #set k to 200 to get all countries MostProdCountries <- S$MostProdCountries

MostProdCountries <- print(S$MostProdCountries)

MostProdCountries$Articles <- as.numeric(MostProdCountries$Articles) MostProdCountries$Country <- gsub("^\s+|\s+$", "", MostProdCountries$Country) #strip whie spaces at the end str(MostProdCountries) MostProdCountries$Country <- as.character(MostProdCountries$Country)

str(World) Country <- toupper(as.character(World$name)) World <- cbind(Country, World) World$Country <- as.character(World$Country)

intersect(MostProdCountries$Country, World$Country ) setdiff(MostProdCountries$Country, World$Country ) #these were not matched to Countries in World data set MostProdCountries$Country <- gsub("USA", "UNITED STATES", MostProdCountries$Country) MostProdCountries$Country <- gsub("HONG KONG", "CHINA", MostProdCountries$Country)

MostProdCountries$Country <- gsub("BAHRAIN", "", MostProdCountries$Country)

MostProdCountries$Country <- gsub("MONACO", "", MostProdCountries$Country)

MostProdCountries$Country <- gsub("BARBADOS", "", MostProdCountries$Country)

MostProdCountries_World <- dplyr::left_join(MostProdCountries, World, by = "Country") #merge data from World data frames onto matching records in MostProdCountries str(MostProdCountries_World) head(MostProdCountries_World)

mlagisz commented 4 years ago

Please modify the code above. You can run this for larger number of top countries (I did 50). I think setting initial number at 200 will cover all countries. You need to use left_join function rather than full_joint and it helps to have a "Country" column in both merged dataframe - the matching country names in this column will be used to merge info from the World dataframe onto the list of most productive countries from bibliometrix.

brieaspasia commented 4 years ago

I've added the code and it worked perfectly to create the MostProdCountries_World df, however it's throwing an error message when I try to map it. I'm wondering if it's because there are now duplicate rows for China and USA (since I've renamed HongKong and Guam) that might be messing with the spatial data? See rows 96-99 - the code creates a map for the original world file, but the code for the MostProdCountries_World throws the error in the screenshot below.

Screen Shot 2020-07-17 at 11 05 11 AM

mlagisz commented 4 years ago

Hi, you can test it by removing all rows that contain China and USA, and then trying to plot. If it works, then indeed it is the duplication that causes the problem. You can resolve this by calculating the sums of the numeric fields for the duplicated rows (basically adding the counts and creating a single row for each country).

There are som good examples on using dplyr package group_by and summarise functions in here: https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group

brieaspasia commented 4 years ago

I solved the duplicate rows using aggregate function, and the join was solved by switching the order -- joining the MostProdCountries to the world file instead of the other way around which lost the spatial data for the unmatched countries. I still have a small issue between 110 and 113 where the sum function isn't working properly.

brieaspasia commented 4 years ago

There may be data missing somewhere -- when I sum the articles_prop column (lines 107-108) the total is only 76.6, which means that nearly 25% of the data is missing... Some of this is due to countries that didn't match the spatial data (lines 71-81), but that doesn't account for 25%. I'm not sure how to troubleshoot this.

mlagisz commented 4 years ago

@brieaspasia, I currently cannot see your most up-todate code version on the GitHub (there is only 87 lines currently there), so cannot test your issue. Let me know when you manage to push your code to GitHub or send me your file via email.

mlagisz commented 4 years ago

You could also attach the code file to comment in this issue?

mlagisz commented 4 years ago

@brieaspasia I run this on scopus2015-2017.bib. I think the issue with proportions not matching might have been due to giving it explicit number of total articles, which could have been wrong. I replaced it with an actual code calculating the sum of the articles in the data frame and it seems to work ok for me. Also, use proportion not percantage. The map looks good :-)