andreaspacher / openeditors

Webscraping data about editors of scientific journals.
https://openeditors.ooir.org/
Creative Commons Zero v1.0 Universal
54 stars 11 forks source link

Non ASCII characters support #1

Open yannnic opened 3 years ago

yannnic commented 3 years ago

Hi, Bravo for this great initiative ! I suppose you already know that non ASCII strings are not well supported in your data. They seem to be filtered out of the strings : erased or replaced. Example from this search : https://openeditors.ooir.org/index.php?editor_query=Nantes : . Journal Title : 'Archives de Pdiatrie' should be 'Archives de Pédiatrie' > character erased . University name : 'Universit de Nantes; Nantes, France' should be 'Université de Nantes; Nantes, France' > character erased . Editor name : 'Francois Galgani' should be ''François Galgani'' > character 'ç' replaced by 'c'

If all characters could be preserved in Unicode, it would be eprfect !

andreaspacher commented 3 years ago

Hello,

thank you for pointing this out.

It seems to be an encoding issue for which I cannot find a quick fix, but I will keep trying.

Just to make sure:

Anyway, I will continue looking for solutions, and I thank you for having pointed it out.

yannnic commented 3 years ago

Thanks a lot for your reply and your best efforts ! Yann

bmkramer commented 3 years ago

@andreaspacher I ran into the same issue when working with the csv-files.

Thinking about a solution: could you perhaps try to specify the encoding as UTF-8 when writing the data to csv with write.csv?

As such: write.csv(df, "Output/editors.csv", fileEncoding = "UTF-8")

Also, when you read the current file(s) back into R on your system, do the special characters display correctly for you? E.g. does df <- read.csv("Output/editors1.csv") print(df$affiliation[1]) result in "Children<U+0092>s Health, Dallas, United States" or "Children’s Health, Dallas, United States" ?

Happy to try and help troubleshoot this further, as your data is super useful!

andreaspacher commented 3 years ago

In the CSV-files, most of the encoding problems should be largely fixed now (with a few exceptions, e.g. some Chinese characters - I will take a look into these last few issues too soon).

I added the fix for the wrongful hex-codes in d1fb71b68396c528d7f1b63ac961cfdfe5e4b059, and for most of the wrongful unicodes in e2448c177f97e20c28e95bfbe9d0657e432f6596.

I resorted to a rather manual cleaning as iconv() or other codes (e.g. from the stringi-library) did not work. The whole encoding was probably messed up from the scraping (?).

Perhaps the fact that your code, @bmkramer, resulted in Children<U+0092>s Health, Dallas, United States indicates that there was too much of a "mojibake" to be fixed through automated codes (if I am not mistaken - I am still a newbie with these matters).

And thank you, @bmkramer, for your suggestaion regarding an explicit reading/writing of CSV-files in UTF-8. This is certainly helpful in the future - I integrated this (e.g. in 5bf111e7e61123e2924c6d3c34be25fd6d294e02).

As regards the online version at https://openeditors.ooir.org, I will correct the data in a few days.

andreaspacher commented 3 years ago

I fixed most of the issues in both the CSV and the online-web version.

A few unicodes that I could not properly identify remained in the dataset; the same applies to names in Chinese characters, of which there were a few (but most often with pinyin-transcriptions anyway). Most of them form part of the journals Bamboo and Silk, The China Nonprofit Review, and Rural China (all at Brill) as well as some of the Frontiers journals.


As a note to myself, I used the following code to fix (as an example) the wrongful hex-codes for the web version (in MySQL):

dbcon<- RMariaDB::dbConnect(MariaDB(), user = "AAXYZ", password = "AAXYZ", dbname = "AAXYZ", host = "AAXYZ")
ascii <- structure(list(Hex = c("<a0>", "<a1>", "<a2>", "<a3>", "<a4>", 
                                "<a5>", "<a6>", "<a7>", "<a8>", "<a9>", "<aa>", "<ab>", "<ac>", 
                                "<ad>", "<ae>", "<af>", "<b0>", "<b1>", "<b2>", "<b3>", "<b4>", 
                                "<b5>", "<b6>", "<b7>", "<b8>", "<b9>", "<ba>", "<bb>", "<bc>", 
                                "<bd>", "<be>", "<bf>", "<c0>", "<c1>", "<c2>", "<c3>", "<c4>", 
                                "<c5>", "<c6>", "<c7>", "<c8>", "<c9>", "<ca>", "<cb>", "<cc>", 
                                "<cd>", "<ce>", "<cf>", "<d0>", "<d1>", "<d2>", "<d3>", "<d4>", 
                                "<d5>", "<d6>", "<d7>", "<d8>", "<d9>", "<da>", "<db>", "<dc>", 
                                "<dd>", "<de>", "<df>", "<e0>", "<e1>", "<e2>", "<e3>", "<e4>", 
                                "<e5>", "<e6>", "<e7>", "<e8>", "<e9>", "<ea>", "<eb>", "<ec>", 
                                "<ed>", "<ee>", "<ef>", "<f0>", "<f1>", "<f2>", "<f3>", "<f4>", 
                                "<f5>", "<f6>", "<f7>", "<f8>", "<f9>", "<fa>", "<fb>", "<fc>", 
                                "<fd>", "<fe>", "<ff>"), Actual = c(" ", "¡", "¢", "£", "¤", 
                                                                    "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "SHY", "®", "¯", "°", 
                                                                    "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", 
                                                                    "¾", "¿", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", 
                                                                    "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", 
                                                                    "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", 
                                                                    "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", 
                                                                    "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", 
                                                                    "ÿ")), row.names = c(NA, -96L), class = "data.frame")

for(i in 1:nrow(ascii)) {
  QUERY <- paste0("
  UPDATE openeditors SET
    journal = REPLACE(journal, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    editor = REPLACE(editor, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    role = REPLACE(role, '", ascii$Hex[i], "', '", ascii$Actual[i], "'),
    affiliation = REPLACE(affiliation, '", ascii$Hex[i], "', '", ascii$Actual[i], "');
  ")
  print(QUERY)

  dbExecute(dbcon, QUERY)

  Sys.sleep(3)
}
bmkramer commented 3 years ago

Thanks @andreaspacher for fixing the encoding issues! Unfortunately, something apparently still happens along the way that causes the csv's to open with the unicode/ASCII codes on my system [no idea why...], but the code you included makes it easy to redo the fixes and proceed :-)

I used this in af88e49905ae4ac2a200cd6a183e739157e40f18 as part of a workflow to match editor affiliations to ROR IDs.

jeroenbaas commented 3 years ago

There seems to be something going on with encoding detection upstream. For instance this title Otolaryngology<U+0096>Head and Neck Surgery Is spelled "Otolaryngology–Head and Neck Surgery" on the website, but that dash is not \u0096 in UTF-8. It looks like these encodings are originating from the input data journal list, so perhaps these should be flagged on https://github.com/andreaspacher/academic-publishers instead. See for instance: https://raw.githubusercontent.com/andreaspacher/academic-publishers/main/Output/alljournals-2021-02-05.csv

It may actually stem from the Scopus reader, as that is loading an xlsx file with Latin1 encoding and not UTF8 (although I don't see the em-dash in the Scopus list for this title, only the short, ascii dash). It is hard to tell where it comes from exactly as the outputs in the publishers repo doesn't have the individual csv outputs stored, only the final merges.