Closed anhtr closed 7 months ago
The section that you are looking for in the xml file looks like this:
<rnaExpression source="HPA" technology="RNAseq" assayType="tissue">
<data>
<tissue organ="Connective & Soft tissue" ontologyTerms="UBERON:0001013">Adipose tissue</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="3.9"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="5.4"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="4.4"/>
<RNASample sampleId="86" unitRNA="nTPM" expRNA="6" sex="Female" age="80"/>
<RNASample sampleId="115" unitRNA="nTPM" expRNA="1.9" sex="Female" age="45"/>
<RNASample sampleId="137" unitRNA="nTPM" expRNA="4.7" sex="Female" age="57"/>
<RNASample sampleId="329" unitRNA="nTPM" expRNA="4.2" sex="Female" age="74"/>
<RNASample sampleId="331" unitRNA="nTPM" expRNA="2.4" sex="Female" age="59"/>
</data>
<data>
<tissue organ="Endocrine tissues" ontologyTerms="UBERON:0002369">Adrenal gland</tissue>
<level type="normalizedRNAExpression" unitRNA="nTPM" expRNA="4.0"/>
<level type="proteinCodingRNAExpression" unitRNA="pTPM" expRNA="6.6"/>
<level type="RNAExpression" unitRNA="TPM" expRNA="5.2"/>
<RNASample sampleId="87" unitRNA="nTPM" expRNA="4.7" sex="Female" age="62"/>
<RNASample sampleId="88" unitRNA="nTPM" expRNA="3.8" sex="Female" age="36"/>
<RNASample sampleId="89" unitRNA="nTPM" expRNA="3.6" sex="Female" age="63"/>
</data>
...
With xml2
, we just need to construct the right xpath for xml_find_all
to get to the desired location. Something like this would help:
library(xml2)
# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")
# Extract the desired information
rna_tissue_exp <- xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data') |>
as_list()
From there you can choose to extract what you want from the resulting list.
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xml2_1.3.5
loaded via a namespace (and not attached):
[1] compiler_4.3.1 magrittr_2.0.3 cli_3.6.1 tools_4.3.1
[5] pillar_1.9.0 glue_1.6.2 rstudioapi_0.15.0 curl_5.0.2
[9] utf8_1.2.3 fansi_1.0.4 vctrs_0.6.3 lifecycle_1.0.3
[13] rlang_1.1.1 purrr_1.0.2
Thank you again for providing a solution to reach the XML tags!
Would greatly appreciate information on how to extract the data from each XML/Gene's rnaExpression
as a dataframe?
tissue | sampleId | expRNA | sex | age |
---|---|---|---|---|
Adipose tissue | 86 | 6 | Female | 80 |
. . . | Adrenal gland | 87 | 4.7 | Female | 62 | . . .
I think something like this may work for your case. It's not a pretty pipe but it gets the work done.
library(xml2)
# library(dplyr)
# Read the XML file
xml <- read_xml("https://www.proteinatlas.org/ENSG00000134057.xml")
# Extract the desired information
rna_tissue_exp <- xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data')
# Initialize empty lists to store data
tissue_list <- list()
sampleId_list <- list()
expRNA_list <- list()
sex_list <- list()
age_list <- list()
# Loop through each <data> element
for (data_node in rna_tissue_exp) {
# Extract tissue
tissue <- xml_text(xml_find_first(data_node, ".//tissue"))
# Extract sample information
sampleId <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sampleId")
expRNA <- xml_attr(xml_find_all(data_node, ".//RNASample"), "expRNA")
sex <- xml_attr(xml_find_all(data_node, ".//RNASample"), "sex")
age <- xml_attr(xml_find_all(data_node, ".//RNASample"), "age")
# Append to lists
tissue_list <- c(tissue_list, rep(tissue, length(sampleId)))
sampleId_list <- c(sampleId_list, sampleId)
expRNA_list <- c(expRNA_list, expRNA)
sex_list <- c(sex_list, sex)
age_list <- c(age_list, age)
}
# Create data frame
df <- data.frame(
tissue = unlist(tissue_list),
sampleId = unlist(sampleId_list),
expRNA = unlist(expRNA_list),
sex = unlist(sex_list),
age = unlist(age_list)
)
# Print the data frame
print(df)
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Chicago
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.1.2 xml2_1.3.5
loaded via a namespace (and not attached):
[1] utf8_1.2.3 R6_2.5.1 tidyselect_1.2.0 magrittr_2.0.3
[5] glue_1.6.2 tibble_3.2.1 pkgconfig_2.0.3 generics_0.1.3
[9] lifecycle_1.0.3 cli_3.6.1 fansi_1.0.4 vctrs_0.6.3
[13] compiler_4.3.1 rstudioapi_0.15.0 tools_4.3.1 curl_5.0.2
[17] pillar_1.9.0 rlang_1.1.1
Thank you so much! Works great and is elegant enough for our use case :)
I believe dplyr
is not needed for this
rna_tissue_exp = xml |>
xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |>
xml_find_all('.//data')
Thank you so much! Works great and is elegant enough for our use case :)
I believe
dplyr
is not needed for thisrna_tissue_exp = xml |> xml_find_all('//rnaExpression[@source="HPA" and @technology="RNAseq" and @assayType="tissue"]') |> xml_find_all('.//data')
Thank you. That's what I get for copy-pasting partial codes.
From an email request