Closed YonghuiDong closed 2 years ago
This appears to be an issue with applying clasyfireR:::parse_json_output
to query results since the list structure of the returned object is different from that returned by get_classification
.
We should discuss with the maintainers how they would like to handle query results moving forward (e.g., should the existing S4 methods be used or should new ones be created). @wilsontom @stanstrup @jasenfinch any thoughts suggestions on how to best implement the query workflow moving forward?
tl;dr:
=> Anyone knows how to make the function passed to purrr::map
to ignore NULL ?
=> Anyone has information/guess in which case(s) the identifier is not NULL ?
Yours, Steffen
Hm, I get a slightly different error (sessionInfo at end):
> submit_query(label = 'query_test', input = 'COC1=C(C=CC(=C1)C(=O)O)O', type = 'STRUCTURE')
Error: Argument 2 is a list, must contain atomic vectors
I don't think the result jsons are vastly different.
The results from a query type = 'STRUCTURE'
are:
> str(json_res[["entities"]][[1]], max.level = 1)
List of 18
$ identifier : NULL
$ smiles : chr "COC1=C(O)C=CC(=C1)C(O)=O"
$ inchikey : chr "InChIKey=WKOLLVMJNQIZCI-UHFFFAOYSA-N"
$ kingdom :List of 4
...
$ predicted_chebi_terms :List of 18
$ predicted_lipidmaps_terms: list()
$ classification_version : chr "2.1"
and the call in get_classification("BRMWTNUJHUMWMS-LURJTMIESA-N")
gives:
> str(json_res, max.level = 1)
List of 17
$ smiles : chr "[H][C@](N)(CC1=CN(C)C=N1)C(O)=O"
$ inchikey : chr "InChIKey=BRMWTNUJHUMWMS-LURJTMIESA-N"
$ kingdom :List of 4
[...]
$ predicted_chebi_terms : chr [1:27] "L-alpha-amino acid (CHEBI:15705)" "imidazolyl carboxylic acid (CHEBI:38307)" "aralkylamine (CHEBI:18000)" "imidazoles (CHEBI:24780)" ...
$ predicted_lipidmaps_terms: list()
$ classification_version : chr "2.1"
The error is in bind_rows(.)
,
which calls into bind_rows_(x, .id)
and .id
is NULL
...
10: dplyr::bind_rows(.)
...
4: purrr::map(1:length(list_output), ~{
l <- list_output[[.]]
tibble::tibble(Level = names(list_output)[.], Classification = l$name,
CHEMONT = l$chemont_id)
}) %>% dplyr::bind_rows() %>% dplyr::filter(!duplicated(Classification)) at internals.R#32
3: .f(.x[[i]], ...)
2: purrr::map(json_res$entities, parse_json_output) at query.R#43
1: submit_query(label = "query_test", input = "COC1=C(C=CC(=C1)C(=O)O)O",
type = "STRUCTURE")
And my sessionInfo():
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] classyfireR_0.3.4 magrittr_1.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 fansi_0.4.1 utf8_1.1.4 assertthat_0.2.1
[5] dplyr_0.8.5 crayon_1.3.4 R6_2.4.1 lifecycle_0.2.0
[9] jsonlite_1.6.1 pillar_1.4.4 httr_1.4.1 cli_2.0.2
[13] rlang_0.4.6 curl_4.3 vctrs_0.2.4 ellipsis_0.3.0
[17] rjson_0.2.20 tools_4.0.2 glue_1.4.0 purrr_0.3.4
[21] compiler_4.0.2 pkgconfig_2.0.3 tidyselect_1.0.0 tibble_3.0.1
@YonghuiDong @gjgetzinger Apologies for taking so long to get to this issue.
I've not had chance to have a proper run through of the new function that @gjgetzinger added yet, so need to re-familiarise myself with the output later today. Ideally; if we can re-use existing S4 methods, where possible, then I think that would be best. If there are major differences in outputs from InChI Keys and SMILE submission, then this can be handled by the parse functions. But, as @sneumann says, the outputs seem very similar.
I think both of these error can be addressed by reworking parse_json_output
, which currently expects only one element at a time and seems to fail when elements are missing from the list.
I suggest returning httr::content
with as = text
and using tidyjson
.
For example, with
input <- c(MOL1 = 'CCCOCC', MOL2 = 'COCC=CCC')
The value of json_res
(i.e., without using parse_json_output
) is
[1] "{\"id\":4479666,\"label\":\"query_test\",\"classification_status\":\"Done\",\"number_of_elements\":2,\"number_of_pages\":1,\"invalid_entities\":[],\"entities\":[{\"identifier\":\"MOL1\",\"smiles\":\"CCCOCC\",\"inchikey\":\"InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N\",\"kingdom\":{\"name\":\"Organic compounds\",\"description\":\"Compounds that contain at least one carbon atom, excluding isocyanide/cyanide and their non-hydrocarbyl derivatives, thiophosgene, carbon diselenide, carbon monosulfide, carbon disulfide, carbon subsulfide, carbon monoxide, carbon dioxide, Carbon suboxide, and dicarbon monoxide.\",\"chemont_id\":\"CHEMONTID:0000000\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000000\"},\"superclass\":{\"name\":\"Organic oxygen compounds\",\"description\":\"Organic compounds that contain one or more oxygen atoms.\",\"chemont_id\":\"CHEMONTID:0004603\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004603\"},\"class\":{\"name\":\"Organooxygen compounds\",\"description\":\"Organic compounds containing a bond between a carbon atom and an oxygen atom.\",\"chemont_id\":\"CHEMONTID:0000323\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000323\"},\"subclass\":{\"name\":\"Ethers\",\"description\":\"Compounds bearing an ether group with the formula Compounds ROR (R not H).\",\"chemont_id\":\"CHEMONTID:0000254\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000254\"},\"intermediate_nodes\":[],\"direct_parent\":{\"name\":\"Dialkyl ethers\",\"description\":\"Organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"chemont_id\":\"CHEMONTID:0001167\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0001167\"},\"alternative_parents\":[{\"name\":\"Hydrocarbon derivatives\",\"description\":\"Derivatives of hydrocarbons obtained by substituting one or more carbon atoms by an heteroatom. They contain at least one carbon atom and heteroatom.\",\"chemont_id\":\"CHEMONTID:0004150\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004150\"}],\"molecular_framework\":\"Aliphatic acyclic compounds\",\"substituents\":[\"Dialkyl ether\",\"Hydrocarbon derivative\",\"Aliphatic acyclic compound\"],\"description\":\"This compound belongs to the class of organic compounds known as dialkyl ethers. These are organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"external_descriptors\":[],\"ancestors\":[\"Chemical entities\",\"Dialkyl ethers\",\"Ethers\",\"Hydrocarbon derivatives\",\"Organic compounds\",\"Organic oxygen compounds\",\"Organooxygen compounds\"],\"predicted_chebi_terms\":[\"chemical entity (CHEBI:24431)\",\"organic molecular entity (CHEBI:50860)\",\"ether (CHEBI:25698)\",\"organooxygen compound (CHEBI:36963)\",\"organic molecule (CHEBI:72695)\",\"oxygen molecular entity (CHEBI:25806)\"],\"predicted_lipidmaps_terms\":[],\"classification_version\":\"2.1\"},{\"identifier\":\"MOL2\",\"smiles\":\"CCC=CCOC\",\"inchikey\":\"InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N\",\"kingdom\":{\"name\":\"Organic compounds\",\"description\":\"Compounds that contain at least one carbon atom, excluding isocyanide/cyanide and their non-hydrocarbyl derivatives, thiophosgene, carbon diselenide, carbon monosulfide, carbon disulfide, carbon subsulfide, carbon monoxide, carbon dioxide, Carbon suboxide, and dicarbon monoxide.\",\"chemont_id\":\"CHEMONTID:0000000\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000000\"},\"superclass\":{\"name\":\"Organic oxygen compounds\",\"description\":\"Organic compounds that contain one or more oxygen atoms.\",\"chemont_id\":\"CHEMONTID:0004603\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004603\"},\"class\":{\"name\":\"Organooxygen compounds\",\"description\":\"Organic compounds containing a bond between a carbon atom and an oxygen atom.\",\"chemont_id\":\"CHEMONTID:0000323\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000323\"},\"subclass\":{\"name\":\"Ethers\",\"description\":\"Compounds bearing an ether group with the formula Compounds ROR (R not H).\",\"chemont_id\":\"CHEMONTID:0000254\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000254\"},\"intermediate_nodes\":[],\"direct_parent\":{\"name\":\"Dialkyl ethers\",\"description\":\"Organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"chemont_id\":\"CHEMONTID:0001167\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0001167\"},\"alternative_parents\":[{\"name\":\"Hydrocarbon derivatives\",\"description\":\"Derivatives of hydrocarbons obtained by substituting one or more carbon atoms by an heteroatom. They contain at least one carbon atom and heteroatom.\",\"chemont_id\":\"CHEMONTID:0004150\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004150\"}],\"molecular_framework\":\"Aliphatic acyclic compounds\",\"substituents\":[\"Dialkyl ether\",\"Hydrocarbon derivative\",\"Aliphatic acyclic compound\"],\"description\":\"This compound belongs to the class of organic compounds known as dialkyl ethers. These are organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"external_descriptors\":[],\"ancestors\":[\"Chemical entities\",\"Dialkyl ethers\",\"Ethers\",\"Hydrocarbon derivatives\",\"Organic compounds\",\"Organic oxygen compounds\",\"Organooxygen compounds\"],\"predicted_chebi_terms\":[\"chemical entity (CHEBI:24431)\",\"organic molecular entity (CHEBI:50860)\",\"ether (CHEBI:25698)\",\"organooxygen compound (CHEBI:36963)\",\"organic molecule (CHEBI:72695)\",\"oxygen molecular entity (CHEBI:25806)\"],\"predicted_lipidmaps_terms\":[],\"classification_version\":\"2.1\"}]}"
This can be nicely coerced to the desired format using the following snippet
json_tib <- json_res %>%
tidyjson::enter_object(entities) %>%
tidyjson::gather_array() %>%
tidyjson::spread_all() %>%
dplyr::as_tibble() %>%
dplyr::group_by(inchikey)
Classification <- json_tib %>%
dplyr::select(inchikey, dplyr::ends_with(".name")) %>%
tidyr::pivot_longer(-inchikey, values_to = "Classification") %>%
tidyr::separate(col = name, into = c("Level", "TYPE")) %>%
dplyr::select(-TYPE)
CHEMONT <- json_tib %>%
dplyr::select(inchikey, dplyr::ends_with(".chemont_id")) %>%
tidyr::pivot_longer(-inchikey, values_to = "CHEMONT") %>%
tidyr::separate(col = name, into = c("Level", "TYPE")) %>%
dplyr::select(-TYPE)
class_tibble <- left_join(Classification, CHEMONT) %>%
dplyr::ungroup()
Which gives:
# A tibble: 10 x 4
inchikey Level Classification CHEMONT
<chr> <chr> <chr> <chr>
1 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N kingdom Organic compounds CHEMONTID:0000000
2 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N superclass Organic oxygen compounds CHEMONTID:0004603
3 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N class Organooxygen compounds CHEMONTID:0000323
4 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N subclass Ethers CHEMONTID:0000254
5 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N direct Dialkyl ethers CHEMONTID:0001167
6 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N kingdom Organic compounds CHEMONTID:0000000
7 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N superclass Organic oxygen compounds CHEMONTID:0004603
8 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N class Organooxygen compounds CHEMONTID:0000323
9 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N subclass Ethers CHEMONTID:0000254
10 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N direct Dialkyl ethers CHEMONTID:0001167
Hi,
I saw @gjgetzinger has added a new function to allow for compound classification using their SMILES values. Thanks a lot! It is a very helpful and needed function.
I have tested this function with couple of SMILES, but for some SMILES strings it didn't work,
I got the following error:
It worked when I query the same SMILES string in ClassiFire website.
Could you please help?
Many thanks.
Dong