Extracted address is truncated when it contains "&"

Hi, this is really a great package and I'm using it for a research project now. I try to understand the distribution of biology-related institutes around the world hence accurate addresses are really important. For most addresses the package seems to be doing a fantastic job but it runs into troubles when an address contains "&", such as in "Department of biology & chemistry." Then, the function article_to_df will return something like "Department of biology &amp." The same problem appears for the name of journals. I was wondering whether there would be any way to fix this. I checked the code and it seems that the problem is caused by the function trim_address. Thanks in advance!

Hello YeWang,

the goal of easyPubMed is to retrieve records as is. The issue you observed is not due to easyPubMed, but to the PubMed data. Indeed, & is an HTML code that stands for &. In a browser, you won't see & cause it is automatically transformed to &. But your R console does not work like a browser, and hence the issue you described. Please, see this link for more info: https://www.w3schools.com/html/html_entities.asp

Apparently, some PubMed records include HTML character entities rather than the corresponding results (for example &as you pointed out). In the future I may consider upgrading easyPubMed to take care of this issue, but at the moment I prefer the idea of retrieving a record as is. This being said, you can easily fix this issue using gsub() to perform regular expression or fixed match substitutions in your strings. An example is shown below.

# custom f(x)
replace_html <- function(x, html_dict) {

    # expected types
    # x, a character vector
    # html_dict a data frame with at least two columns
    # named html and replacement
    for(i in seq_len(nrow(html.dict))){
        x <- gsub(pattern = html_dict$html[i], 
                  replacement = html_dict$replacement[i], 
                  x = x)

    }
    return(x)
}

# custom HTML conversion dictionary (data.frame)
html_dict <- data.frame(html = c("&nbsp;", "&lt;", "&gt;", 
                                 "&amp;", "&quot;", "&apos;"), 
           replacement = c(" ", "<", ">", "&", '"',"'" ), 
           stringsAsFactors = FALSE)

# Some strings including &amp; and co
x <- c("Hello world", "I'd like some ice cream and pizza", 
       "I&apos;d love to sing", "Peanut butter &amp; jelly", 
       "I&apos;ll be here &amp; there &amp; over there")

# Convert
y <- replace_html(x = x, html_dict = html_dict)

# Show before and after
data.frame(before = x, after = y)

Hope this helps. Best regards. D.

Hello YeWang,

the goal of easyPubMed is to retrieve records as is. The issue you observed is not due to easyPubMed, but to the PubMed data. Indeed, & is an HTML code that stands for &. In a browser, you won't see & cause it is automatically transformed to &. But your R console does not work like a browser, and hence the issue you described. Please, see this link for more info: https://www.w3schools.com/html/html_entities.asp

Apparently, some PubMed records include HTML character entities rather than the corresponding results (for example &as you pointed out). In the future I may consider upgrading easyPubMed to take care of this issue, but at the moment I prefer the idea of retrieving a record as is. This being said, you can easily fix this issue using gsub() to perform regular expression or fixed match substitutions in your strings. An example is shown below.
# custom f(x)
replace_html <- function(x, html_dict) {

    # expected types
    # x, a character vector
    # html_dict a data frame with at least two columns
    # named html and replacement
    for(i in seq_len(nrow(html.dict))){
        x <- gsub(pattern = html_dict$html[i], 
                  replacement = html_dict$replacement[i], 
                  x = x)

    }
    return(x)
}

# custom HTML conversion dictionary (data.frame)
html_dict <- data.frame(html = c("&nbsp;", "&lt;", "&gt;", 
                                 "&amp;", "&quot;", "&apos;"), 
           replacement = c(" ", "<", ">", "&", '"',"'" ), 
           stringsAsFactors = FALSE)

# Some strings including &amp; and co
x <- c("Hello world", "I'd like some ice cream and pizza", 
       "I&apos;d love to sing", "Peanut butter &amp; jelly", 
       "I&apos;ll be here &amp; there &amp; over there")

# Convert
y <- replace_html(x = x, html_dict = html_dict)

# Show before and after
data.frame(before = x, after = y)
Hope this helps. Best regards. D.

Great. This is helpful. Thanks a lot!

dami82 / easyPubMed

Extracted address is truncated when it contains "&" #7