anhtr / HPAanalyze

A Bioconductor package to retrieve and analyze data from the Human Protein Atlas
GNU General Public License v3.0
32 stars 9 forks source link

Tissue RNA expression from XML file #23

Closed anhtr closed 7 months ago

anhtr commented 7 months ago

From an email request

Hope you are having a great week and thank you for making this amazing tool!

We had a query about the functions to get and parse XML data.

Specifically, we wanted to extract the Tissue RNA Expression data from the XML file, and would like to know if any built-in function can do that?

The XML tag for that section is: <rnaExpression source="HPA" technology="RNAseq" assayType="tissue">

If not for a built-in function, can you please suggest how to achieve this with xml2?

anhtr commented 7 months ago

The section that you are looking for in the xml file looks like this:

<rnaExpression source="HPA" technology="RNAseq" assayType="tissue">
<data>
<tissue organ="Connective & Soft tissue" ontologyTerms="UBERON:0001013">Adipose tissue</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="3.9"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="5.4"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="4.4"/>
<RNASample sampleId="86" unitRNA="nTPM" expRNA="6" sex="Female" age="80"/>
<RNASample sampleId="115" unitRNA="nTPM" expRNA="1.9" sex="Female" age="45"/>
<RNASample sampleId="137" unitRNA="nTPM" expRNA="4.7" sex="Female" age="57"/>
<RNASample sampleId="329" unitRNA="nTPM" expRNA="4.2" sex="Female" age="74"/>
<RNASample sampleId="331" unitRNA="nTPM" expRNA="2.4" sex="Female" age="59"/>
</data>
<data>
<tissue organ="Endocrine tissues" ontologyTerms="UBERON:0002369">Adrenal gland</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="4.0"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="6.6"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="5.2"/>
<RNASample sampleId="87" unitRNA="nTPM" expRNA="4.7" sex="Female" age="62"/>
<RNASample sampleId="88" unitRNA="nTPM" expRNA="3.8" sex="Female" age="36"/>
<RNASample sampleId="89" unitRNA="nTPM" expRNA="3.6" sex="Female" age="63"/>
</data>
...

With xml2, we just need to construct the right xpath for xml_find_all to get to the desired location. Something like this would help:

library(xml2)

# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")

# Extract the desired information
rna_tissue_exp <- xml |>
  xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
  xml_find_all('.//data') |>
  as_list()

From there you can choose to extract what you want from the resulting list.

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_1.3.5

loaded via a namespace (and not attached):
 [1] compiler_4.3.1    magrittr_2.0.3    cli_3.6.1         tools_4.3.1      
 [5] pillar_1.9.0      glue_1.6.2        rstudioapi_0.15.0 curl_5.0.2       
 [9] utf8_1.2.3        fansi_1.0.4       vctrs_0.6.3       lifecycle_1.0.3  
[13] rlang_1.1.1       purrr_1.0.2    
SuhasSrinivasan commented 7 months ago

Thank you again for providing a solution to reach the XML tags!

Would greatly appreciate information on how to extract the data from each XML/Gene's rnaExpression as a dataframe?

tissue sampleId expRNA sex age
Adipose tissue 86 6 Female 80

. . . | Adrenal gland | 87 | 4.7 | Female | 62 | . . .

anhtr commented 7 months ago

I think something like this may work for your case. It's not a pretty pipe but it gets the work done.

library(xml2)
# library(dplyr)

# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")

# Extract the desired information
rna_tissue_exp <- xml |>
  xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
  xml_find_all('.//data')

# Initialize empty lists to store data
tissue_list <- list()
sampleId_list <- list()
expRNA_list <- list()
sex_list <- list()
age_list <- list()

# Loop through each <data> element
for (data_node in rna_tissue_exp) {
  # Extract tissue
  tissue <- xml_text(xml_find_first(data_node, ".//tissue"))

  # Extract sample information
  sampleId <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sampleId")
  expRNA <- xml_attr(xml_find_all(data_node, ".//RNASample"), "expRNA")
  sex <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sex")
  age <- xml_attr(xml_find_all(data_node, ".//RNASample"), "age")

  # Append to lists
  tissue_list <- c(tissue_list, rep(tissue, length(sampleId)))
  sampleId_list <- c(sampleId_list, sampleId)
  expRNA_list <- c(expRNA_list, expRNA)
  sex_list <- c(sex_list, sex)
  age_list <- c(age_list, age)
}

# Create data frame
df <- data.frame(
  tissue = unlist(tissue_list),
  sampleId = unlist(sampleId_list),
  expRNA = unlist(expRNA_list),
  sex = unlist(sex_list),
  age = unlist(age_list)
)

# Print the data frame
print(df)
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.2 xml2_1.3.5 

loaded via a namespace (and not attached):
 [1] utf8_1.2.3        R6_2.5.1          tidyselect_1.2.0  magrittr_2.0.3   
 [5] glue_1.6.2        tibble_3.2.1      pkgconfig_2.0.3   generics_0.1.3   
 [9] lifecycle_1.0.3   cli_3.6.1         fansi_1.0.4       vctrs_0.6.3      
[13] compiler_4.3.1    rstudioapi_0.15.0 tools_4.3.1       curl_5.0.2       
[17] pillar_1.9.0      rlang_1.1.1 
SuhasSrinivasan commented 7 months ago

Thank you so much! Works great and is elegant enough for our use case :)

I believe dplyr is not needed for this

rna_tissue_exp = xml |>
  xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
  xml_find_all('.//data')
anhtr commented 7 months ago

Thank you so much! Works great and is elegant enough for our use case :)

I believe dplyr is not needed for this

rna_tissue_exp = xml |>
  xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
  xml_find_all('.//data')

Thank you. That's what I get for copy-pasting partial codes.