Need to address reviewers' concerns

anhtr commented 5 years ago

Reviewer 1: HPAnalyze package by Tran et al.

[x] I had trouble installing using the installer script provided. By a stroke of determined try, we removed the phrase "Version 3.9" then the installation worked! This needs to be checked. The package is working fine and functions are working. I have only minor comments.
[x] In the data under $Pathology, it is recommended that the units of the numbers be displayed as well such as FPKM values??

Reviewer 2: Review: HPAanalyze: An R Package that Facilitates the Retrieval and Analysis of The Human Protein Atlas Data

Authors: Anh Nhat Tran, Alex M. Dussaq, Timothy Kennell, Jr., Christopher D. Willey, Anita B. Hjelmeland

Review

Anh Nhat Tran et al. describe HPanalyze, an R package for downloading and facilitating analysis of data from the Human Protein Atlas (HPA). The package was easy to use and has some nice built in visualization functions for displaying information of broad interest from the HPA, making it likely useful to users with some R experience. Since the authors state in the "HPAanalyize Overview" section that it aims to serve users with little programming experience, there is some room to improve the usability of the package and clarity of the text in that regard.

Comments

[x] There are many instances of missing references that are labeled as "Error! Reference source not found."
[x] Page 7, line 39: After subsetting histology data for entries matching the gene MKI67 and breast tissue, the authors state: "These data are congruent with the expected pattern of Ki67 expression based on the literature and validate our approach." Comparing the displayed results with the HPA downloads shows that the hpaSubset command does appropriately subset their data, but the correctness of the localization of MKI67 and its expression in breast tissue has nothing to do with your package. This results from the methodology of the HPA.
[x] Currently the hpaExport command only supports exporting in Excel format. Having it support CSV and JSON as export formats would greatly enhance the compatibility of this package with downstream pipelines outside of R. As is, users would likely need to export using an alternate method or convert xlsx into their desired format after exporting which makes it less likely this function would be used.
[x] What is the evidence requirement for "Detected" status when using the hpaVisSubcell() function? Is it enhanced, supported, approved or uncertain? It would be nice to have an argument that the user could specify to control this, for example if I want anything with at least the approved status to be considered as "Detected", I could run hpaVisSubcell(data, status="approved"). I realize I could filter the data before passing it to this function but if this package is meant to be used by people with little programming experience, this would likely be very helpful for them. This issue becomes apparent in case study 2 when the authors state PTEN is present in the cytosol and nucleoplasm, but at what evidence level? According to the HPA it is with "supported" status. If this is the default I think it needs to be very explicit, otherwise people might assume it was the validated/enhanced status.
[x] The case studies presented in the manuscript are nice and very interesting, but since the authors are trying to demonstrate the utility of their package, it is critical that they show the exact commands they are using to subset the HPA data and generate their images. For example, I could reproduce figure 3A by entering the commands. It was nice to know that I understood how to generate the figure, but I think most people would find it more helpful to have it explicitly spelled out in the text somewhere.
```
data = hpaDownload()
case1Genes = c("GFAP", "EGFR", "PDGFRA", "PIK3CA", "PTEN", "BRAF", "MDM2", "MDM4", "CDK4")
case1Tissues = c("hippocampus", "cerebral cortex")
hpaVisTissue(data, targetGene = case1Genes, targetTissue = case1Tissues)
```
[x] In case study 3, the authors state "HPAanalyze indicated that GCH1 was not expressed in normal brain (cerebellum, cortex, hippocampus or lateral ventricle/caudate) except for in the neuronal cells of the lateral ventricle (Figure 4A)". Figure 4A seems to show no expression at all for GCH1 in the displayed cell lines. Should neuronal cells of the lateral ventricle be included here and labeled for non-specialists?
[x] It is unclear how HPAanalyze was used to generate figure 4C. I suspect it wasn't but then it is unclear where this data came from, as there are no accompanying methods for this sub figure.
[x] When I tried to use hpaXmlGet() I received the error that the "curl" package was not installed. I installed this package but it wasn't listed as a requirement in the manuscript. Should it (or can it be) installed as a dependency when issuing the command BiocManager::install("HPAanalyze")?
[x] It would be really helpful for most if the hpaXmlGet () command would also accept a gene name in addition to an Ensembl ID. I realize the HPA uses Ensembl as their primary gene identifier but their downloadable data always seems to list both the Ensembl ID and the gene name. It would be great if this could be implemented.
[x] It would also be nice if there were a relevant example on how to use the hpaXML functions.
[x] Figures 1, 2A and 2C don't seem to be referenced anywhere in the text.

Reviewer 3: The manuscript by Tran et al presents HPAanalyze, an R package simplifying access to data from the Human Protein Atlas (HPA).

The manuscript is written in clear English. HPAanalyze is a thin layer on top of existing R functionality. Given the popularity of HPA, the package is likely to be useful to the community. I'm not sure the presented functionality warrants a publication, but apart from that there are no major issues in the presented manuscript.

Minor points:

[x] Page 9, second paragraph: "tibble" is R jargon and should be explained.
[x] The case studies and figures 2-4 should have specific, complete R code associated with them where possible, either as supplementary material, or through references to GitHub.

anhtr commented 5 years ago

Currently the hpaExport command only supports exporting in Excel format. Having it support CSV and JSON as export formats would greatly enhance the compatibility of this package with downstream pipelines outside of R. As is, users would likely need to export using an alternate method or convert xlsx into their desired format after exporting which makes it less likely this function would be used.

Added options for .csv and .tsv to hpaExport. Added JSON support through vignette to reduce package dependencies.

anhtr commented 5 years ago

When I tried to use hpaXmlGet() I received the error that the "curl" package was not installed. I installed this package but it wasn't listed as a requirement in the manuscript. Should it (or can it be) installed as a dependency when issuing the command BiocManager::install("HPAanalyze")?

This resulted from the fact that the xml2::download_xml used curl to download xml file, but xml2 did not import the package. We have replace download_xml with a base r equivalent.

anhtr commented 5 years ago

It would be really helpful for most if the hpaXmlGet () command would also accept a gene name in addition to an Ensembl ID. I realize the HPA uses Ensembl as their primary gene identifier but their downloadable data always seems to list both the Ensembl ID and the gene name. It would be great if this could be implemented.

We have updated the package with the ability to automatically convert between HGNC gene symbols and ensembl ids in functions which use the argument "targetGene", "targetEnsemblId" and "inputXml". For functions which can process more than one gene at a time, users can now use a vector of mixed gene symbols and ensembl ids.

anhtr commented 5 years ago

What is the evidence requirement for "Detected" status when using the hpaVisSubcell() function? Is it enhanced, supported, approved or uncertain? It would be nice to have an argument that the user could specify to control this, for example if I want anything with at least the approved status to be considered as "Detected", I could run hpaVisSubcell(data, status="approved"). I realize I could filter the data before passing it to this function but if this package is meant to be used by people with little programming experience, this would likely be very helpful for them. This issue becomes apparent in case study 2 when the authors state PTEN is present in the cytosol and nucleoplasm, but at what evidence level? According to the HPA it is with "supported" status. If this is the default I think it needs to be very explicit, otherwise people might assume it was the validated/enhanced status.

In the original manuscript, the hpaVisSubcell function plot everything, including: enhanced, supported, approved and uncertain. We recognized that this was undesirable and had updated the function to include the argument reliability through which users could specify which evidences they prefer. For reproducibility purpose, the default is still to plot everything.

In the revised manuscript, we explicitly plotted with reliability = c("enhanced", "supported", "approved") in Figures 2C, 3C, 4D.

anhtr commented 5 years ago

I had trouble installing using the installer script provided. By a stroke of determined try, we removed the phrase "Version 3.9" then the installation worked! This needs to be checked. The package is working fine and functions are working. I have only minor comments.

During the previous release cycle, Bioconductor updated the installer script where BiocLite was replaced with BiocManager with versioning, which caused the issue. The issue should now be rectified, and the correct installer script are:

## for stable Bioconductor release
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("HPAanalyze")

## for developmental version
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("trannhatanh89/HPAanalyze")

anhtr commented 5 years ago

In the data under $Pathology, it is recommended that the units of the numbers be displayed as well such as FPKM values??

In the Pathology dataset, the numbers are defined as "the number of patients annotated for different staining levels". We have updated the "Working with HPA Downloadable Datasets" section to reflect that.

anhtr commented 5 years ago

There are many instances of missing references that are labeled as "Error! Reference source not found."

The issue was caused when we edited the draft Word document which contain Figures and Table cross-references to match the journal's formatting. We have corrected it in our revised manuscript.

anhtr commented 5 years ago

Page 7, line 39: After subsetting histology data for entries matching the gene MKI67 and breast tissue, the authors state: "These data are congruent with the expected pattern of Ki67 expression based on the literature and validate our approach." Comparing the displayed results with the HPA downloads shows that the hpaSubset command does appropriately subset their data, but the correctness of the localization of MKI67 and its expression in breast tissue has nothing to do with your package. This results from the methodology of the HPA.

We have removed the confusing statement.

anhtr commented 5 years ago

Page 9, second paragraph: "tibble" is R jargon and should be explained.

We have replaced "tibble" with "data frame", with a note at the first mentioning to clarify that we employed the “tibble”/tbl_df variant with citation.

anhtr commented 5 years ago

The case studies presented in the manuscript are nice and very interesting, but since the authors are trying to demonstrate the utility of their package, it is critical that they show the exact commands they are using to subset the HPA data and generate their images. For example, I could reproduce figure 3A by entering the commands. It was nice to know that I understood how to generate the figure, but I think most people would find it more helpful to have it explicitly spelled out in the text somewhere.

The case studies and figures 2-4 should have specific, complete R code associated with them where possible, either as supplementary material, or through references to GitHub.

Codes to generate figures are now available as supplementary material as well as a vignette bundled with the package.

anhtr commented 5 years ago

Figures 1, 2A and 2C don't seem to be referenced anywhere in the text.

Due to an error when the word file was reformatted, many of the figure references was changed to "Error! Reference source not found." This has been corrected in the revised manuscript.

anhtr commented 5 years ago

In case study 3, the authors state "HPAanalyze indicated that GCH1 was not expressed in normal brain (cerebellum, cortex, hippocampus or lateral ventricle/caudate) except for in the neuronal cells of the lateral ventricle (Figure 4A)". Figure 4A seems to show no expression at all for GCH1 in the displayed cell lines. Should neuronal cells of the lateral ventricle be included here and labeled for non-specialists?

Upon reanalysis, we found this statement to erroneous. We have removed this from the manuscript, and included "caudate" in figure 4A.

anhtr commented 5 years ago

It is unclear how HPAanalyze was used to generate figure 4C. I suspect it wasn't but then it is unclear where this data came from, as there are no accompanying methods for this sub figure.

Figure 4C was generated using the GlioVis portal. We have included this information in the figure caption and the details in the supplemental material. We also included the data used to create the plots.

anhtr commented 5 years ago

It would also be nice if there were a relevant example on how to use the hpaXML functions.

We have added Case study 4 which provide an example of how hpaXml functions work.

anhtr / HPAanalyze

Need to address reviewers' concerns #5