dami82 / easyPubMed

easyPubMed package for R - dev version
21 stars 8 forks source link

Extracted address is truncated when it contains "&" #7

Closed YeWang1576 closed 3 years ago

YeWang1576 commented 3 years ago

Hi, this is really a great package and I'm using it for a research project now. I try to understand the distribution of biology-related institutes around the world hence accurate addresses are really important. For most addresses the package seems to be doing a fantastic job but it runs into troubles when an address contains "&", such as in "Department of biology & chemistry." Then, the function article_to_df will return something like "Department of biology &amp." The same problem appears for the name of journals. I was wondering whether there would be any way to fix this. I checked the code and it seems that the problem is caused by the function trim_address. Thanks in advance!

dami82 commented 3 years ago

Hello YeWang,

the goal of easyPubMed is to retrieve records as is. The issue you observed is not due to easyPubMed, but to the PubMed data. Indeed, & is an HTML code that stands for &. In a browser, you won't see & cause it is automatically transformed to &. But your R console does not work like a browser, and hence the issue you described. Please, see this link for more info: https://www.w3schools.com/html/html_entities.asp

Apparently, some PubMed records include HTML character entities rather than the corresponding results (for example &as you pointed out). In the future I may consider upgrading easyPubMed to take care of this issue, but at the moment I prefer the idea of retrieving a record as is. This being said, you can easily fix this issue using gsub() to perform regular expression or fixed match substitutions in your strings. An example is shown below.

# custom f(x)
replace_html <- function(x, html_dict) {

    # expected types
    # x, a character vector
    # html_dict a data frame with at least two columns
    # named html and replacement
    for(i in seq_len(nrow(html.dict))){
        x <- gsub(pattern = html_dict$html[i], 
                  replacement = html_dict$replacement[i], 
                  x = x)

    }
    return(x)
}

# custom HTML conversion dictionary (data.frame)
html_dict <- data.frame(html = c("&nbsp;", "&lt;", "&gt;", 
                                 "&amp;", "&quot;", "&apos;"), 
           replacement = c(" ", "<", ">", "&", '"',"'" ), 
           stringsAsFactors = FALSE)

# Some strings including &amp; and co
x <- c("Hello world", "I'd like some ice cream and pizza", 
       "I&apos;d love to sing", "Peanut butter &amp; jelly", 
       "I&apos;ll be here &amp; there &amp; over there")

# Convert
y <- replace_html(x = x, html_dict = html_dict)

# Show before and after
data.frame(before = x, after = y)

Hope this helps. Best regards. D.

YeWang1576 commented 3 years ago

Hello YeWang,

the goal of easyPubMed is to retrieve records as is. The issue you observed is not due to easyPubMed, but to the PubMed data. Indeed, &amp; is an HTML code that stands for &. In a browser, you won't see &amp; cause it is automatically transformed to &. But your R console does not work like a browser, and hence the issue you described. Please, see this link for more info: https://www.w3schools.com/html/html_entities.asp

Apparently, some PubMed records include HTML character entities rather than the corresponding results (for example &amp;as you pointed out). In the future I may consider upgrading easyPubMed to take care of this issue, but at the moment I prefer the idea of retrieving a record as is. This being said, you can easily fix this issue using gsub() to perform regular expression or fixed match substitutions in your strings. An example is shown below.

# custom f(x)
replace_html <- function(x, html_dict) {

    # expected types
    # x, a character vector
    # html_dict a data frame with at least two columns
    # named html and replacement
    for(i in seq_len(nrow(html.dict))){
        x <- gsub(pattern = html_dict$html[i], 
                  replacement = html_dict$replacement[i], 
                  x = x)

    }
    return(x)
}

# custom HTML conversion dictionary (data.frame)
html_dict <- data.frame(html = c("&nbsp;", "&lt;", "&gt;", 
                                 "&amp;", "&quot;", "&apos;"), 
           replacement = c(" ", "<", ">", "&", '"',"'" ), 
           stringsAsFactors = FALSE)

# Some strings including &amp; and co
x <- c("Hello world", "I'd like some ice cream and pizza", 
       "I&apos;d love to sing", "Peanut butter &amp; jelly", 
       "I&apos;ll be here &amp; there &amp; over there")

# Convert
y <- replace_html(x = x, html_dict = html_dict)

# Show before and after
data.frame(before = x, after = y)

Hope this helps. Best regards. D.

Great. This is helpful. Thanks a lot!