classification based on SMILES

YonghuiDong commented 4 years ago

Hi,

I saw @gjgetzinger has added a new function to allow for compound classification using their SMILES values. Thanks a lot! It is a very helpful and needed function.

I have tested this function with couple of SMILES, but for some SMILES strings it didn't work,

I got the following error:

submit_query(label = 'query_test', input = 'COC1=C(C=CC(=C1)C(=O)O)O', type = 'STRUCTURE')

Error in vec_rbind(!!!dots, .names_to = .id) : Internal error in vec_proxy_assign_opts(): proxy of type character incompatible with value proxy of type NULL.

It worked when I query the same SMILES string in ClassiFire website.

Could you please help?

Many thanks.

Dong

gjgetzinger commented 4 years ago

This appears to be an issue with applying clasyfireR:::parse_json_output to query results since the list structure of the returned object is different from that returned by get_classification.

We should discuss with the maintainers how they would like to handle query results moving forward (e.g., should the existing S4 methods be used or should new ones be created). @wilsontom @stanstrup @jasenfinch any thoughts suggestions on how to best implement the query workflow moving forward?

sneumann commented 4 years ago

tl;dr: => Anyone knows how to make the function passed to purrr::map to ignore NULL ? => Anyone has information/guess in which case(s) the identifier is not NULL ?

Yours, Steffen

Hm, I get a slightly different error (sessionInfo at end):

> submit_query(label = 'query_test', input = 'COC1=C(C=CC(=C1)C(=O)O)O', type = 'STRUCTURE')
Error: Argument 2 is a list, must contain atomic vectors

I don't think the result jsons are vastly different. The results from a query type = 'STRUCTURE' are:

> str(json_res[["entities"]][[1]], max.level = 1)
List of 18
 $ identifier               : NULL
 $ smiles                   : chr "COC1=C(O)C=CC(=C1)C(O)=O"
 $ inchikey                 : chr "InChIKey=WKOLLVMJNQIZCI-UHFFFAOYSA-N"
 $ kingdom                  :List of 4
 ...
 $ predicted_chebi_terms    :List of 18
 $ predicted_lipidmaps_terms: list()
 $ classification_version   : chr "2.1"

and the call in get_classification("BRMWTNUJHUMWMS-LURJTMIESA-N") gives:

> str(json_res, max.level = 1)
List of 17
 $ smiles                   : chr "[H][C@](N)(CC1=CN(C)C=N1)C(O)=O"
 $ inchikey                 : chr "InChIKey=BRMWTNUJHUMWMS-LURJTMIESA-N"
 $ kingdom                  :List of 4
[...]
 $ predicted_chebi_terms    : chr [1:27] "L-alpha-amino acid (CHEBI:15705)" "imidazolyl carboxylic acid (CHEBI:38307)" "aralkylamine (CHEBI:18000)" "imidazoles (CHEBI:24780)" ...
 $ predicted_lipidmaps_terms: list()
 $ classification_version   : chr "2.1"

The error is in bind_rows(.), which calls into bind_rows_(x, .id) and .id is NULL

...
10: dplyr::bind_rows(.)
...
4: purrr::map(1:length(list_output), ~{
       l <- list_output[[.]]
       tibble::tibble(Level = names(list_output)[.], Classification = l$name, 
           CHEMONT = l$chemont_id)
   }) %>% dplyr::bind_rows() %>% dplyr::filter(!duplicated(Classification)) at internals.R#32
3: .f(.x[[i]], ...)
2: purrr::map(json_res$entities, parse_json_output) at query.R#43
1: submit_query(label = "query_test", input = "COC1=C(C=CC(=C1)C(=O)O)O", 
       type = "STRUCTURE")

And my sessionInfo():

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] classyfireR_0.3.4 magrittr_1.5     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     fansi_0.4.1      utf8_1.1.4       assertthat_0.2.1
 [5] dplyr_0.8.5      crayon_1.3.4     R6_2.4.1         lifecycle_0.2.0 
 [9] jsonlite_1.6.1   pillar_1.4.4     httr_1.4.1       cli_2.0.2       
[13] rlang_0.4.6      curl_4.3         vctrs_0.2.4      ellipsis_0.3.0  
[17] rjson_0.2.20     tools_4.0.2      glue_1.4.0       purrr_0.3.4     
[21] compiler_4.0.2   pkgconfig_2.0.3  tidyselect_1.0.0 tibble_3.0.1

wilsontom commented 4 years ago

@YonghuiDong @gjgetzinger Apologies for taking so long to get to this issue.

I've not had chance to have a proper run through of the new function that @gjgetzinger added yet, so need to re-familiarise myself with the output later today. Ideally; if we can re-use existing S4 methods, where possible, then I think that would be best. If there are major differences in outputs from InChI Keys and SMILE submission, then this can be handled by the parse functions. But, as @sneumann says, the outputs seem very similar.

gjgetzinger commented 4 years ago

I think both of these error can be addressed by reworking parse_json_output, which currently expects only one element at a time and seems to fail when elements are missing from the list.

I suggest returning httr::content with as = text and using tidyjson.

For example, with

input <- c(MOL1 = 'CCCOCC', MOL2 = 'COCC=CCC')

The value of json_res (i.e., without using parse_json_output) is

[1] "{\"id\":4479666,\"label\":\"query_test\",\"classification_status\":\"Done\",\"number_of_elements\":2,\"number_of_pages\":1,\"invalid_entities\":[],\"entities\":[{\"identifier\":\"MOL1\",\"smiles\":\"CCCOCC\",\"inchikey\":\"InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N\",\"kingdom\":{\"name\":\"Organic compounds\",\"description\":\"Compounds that contain at least one carbon atom, excluding isocyanide/cyanide and their non-hydrocarbyl derivatives, thiophosgene, carbon diselenide, carbon monosulfide, carbon disulfide, carbon subsulfide, carbon monoxide, carbon dioxide, Carbon suboxide, and dicarbon monoxide.\",\"chemont_id\":\"CHEMONTID:0000000\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000000\"},\"superclass\":{\"name\":\"Organic oxygen compounds\",\"description\":\"Organic compounds that contain one or more oxygen atoms.\",\"chemont_id\":\"CHEMONTID:0004603\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004603\"},\"class\":{\"name\":\"Organooxygen compounds\",\"description\":\"Organic compounds containing a bond between a carbon atom and an oxygen atom.\",\"chemont_id\":\"CHEMONTID:0000323\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000323\"},\"subclass\":{\"name\":\"Ethers\",\"description\":\"Compounds bearing an ether group with the formula Compounds ROR (R not H).\",\"chemont_id\":\"CHEMONTID:0000254\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000254\"},\"intermediate_nodes\":[],\"direct_parent\":{\"name\":\"Dialkyl ethers\",\"description\":\"Organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"chemont_id\":\"CHEMONTID:0001167\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0001167\"},\"alternative_parents\":[{\"name\":\"Hydrocarbon derivatives\",\"description\":\"Derivatives of hydrocarbons obtained by substituting one or more carbon atoms by an heteroatom. They contain at least one carbon atom and heteroatom.\",\"chemont_id\":\"CHEMONTID:0004150\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004150\"}],\"molecular_framework\":\"Aliphatic acyclic compounds\",\"substituents\":[\"Dialkyl ether\",\"Hydrocarbon derivative\",\"Aliphatic acyclic compound\"],\"description\":\"This compound belongs to the class of organic compounds known as dialkyl ethers. These are organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"external_descriptors\":[],\"ancestors\":[\"Chemical entities\",\"Dialkyl ethers\",\"Ethers\",\"Hydrocarbon derivatives\",\"Organic compounds\",\"Organic oxygen compounds\",\"Organooxygen compounds\"],\"predicted_chebi_terms\":[\"chemical entity (CHEBI:24431)\",\"organic molecular entity (CHEBI:50860)\",\"ether (CHEBI:25698)\",\"organooxygen compound (CHEBI:36963)\",\"organic molecule (CHEBI:72695)\",\"oxygen molecular entity (CHEBI:25806)\"],\"predicted_lipidmaps_terms\":[],\"classification_version\":\"2.1\"},{\"identifier\":\"MOL2\",\"smiles\":\"CCC=CCOC\",\"inchikey\":\"InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N\",\"kingdom\":{\"name\":\"Organic compounds\",\"description\":\"Compounds that contain at least one carbon atom, excluding isocyanide/cyanide and their non-hydrocarbyl derivatives, thiophosgene, carbon diselenide, carbon monosulfide, carbon disulfide, carbon subsulfide, carbon monoxide, carbon dioxide, Carbon suboxide, and dicarbon monoxide.\",\"chemont_id\":\"CHEMONTID:0000000\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000000\"},\"superclass\":{\"name\":\"Organic oxygen compounds\",\"description\":\"Organic compounds that contain one or more oxygen atoms.\",\"chemont_id\":\"CHEMONTID:0004603\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004603\"},\"class\":{\"name\":\"Organooxygen compounds\",\"description\":\"Organic compounds containing a bond between a carbon atom and an oxygen atom.\",\"chemont_id\":\"CHEMONTID:0000323\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000323\"},\"subclass\":{\"name\":\"Ethers\",\"description\":\"Compounds bearing an ether group with the formula Compounds ROR (R not H).\",\"chemont_id\":\"CHEMONTID:0000254\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0000254\"},\"intermediate_nodes\":[],\"direct_parent\":{\"name\":\"Dialkyl ethers\",\"description\":\"Organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"chemont_id\":\"CHEMONTID:0001167\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0001167\"},\"alternative_parents\":[{\"name\":\"Hydrocarbon derivatives\",\"description\":\"Derivatives of hydrocarbons obtained by substituting one or more carbon atoms by an heteroatom. They contain at least one carbon atom and heteroatom.\",\"chemont_id\":\"CHEMONTID:0004150\",\"url\":\"http://classyfire.wishartlab.com/tax_nodes/C0004150\"}],\"molecular_framework\":\"Aliphatic acyclic compounds\",\"substituents\":[\"Dialkyl ether\",\"Hydrocarbon derivative\",\"Aliphatic acyclic compound\"],\"description\":\"This compound belongs to the class of organic compounds known as dialkyl ethers. These are organic compounds containing the dialkyl ether functional group, with the formula ROR', where R and R' are alkyl groups.\",\"external_descriptors\":[],\"ancestors\":[\"Chemical entities\",\"Dialkyl ethers\",\"Ethers\",\"Hydrocarbon derivatives\",\"Organic compounds\",\"Organic oxygen compounds\",\"Organooxygen compounds\"],\"predicted_chebi_terms\":[\"chemical entity (CHEBI:24431)\",\"organic molecular entity (CHEBI:50860)\",\"ether (CHEBI:25698)\",\"organooxygen compound (CHEBI:36963)\",\"organic molecule (CHEBI:72695)\",\"oxygen molecular entity (CHEBI:25806)\"],\"predicted_lipidmaps_terms\":[],\"classification_version\":\"2.1\"}]}"

This can be nicely coerced to the desired format using the following snippet

json_tib <- json_res %>%
    tidyjson::enter_object(entities) %>%
    tidyjson::gather_array() %>%
    tidyjson::spread_all() %>%
    dplyr::as_tibble() %>%
    dplyr::group_by(inchikey)

  Classification <- json_tib %>%
    dplyr::select(inchikey, dplyr::ends_with(".name")) %>%
    tidyr::pivot_longer(-inchikey, values_to = "Classification") %>%
    tidyr::separate(col = name, into = c("Level", "TYPE")) %>%
    dplyr::select(-TYPE)

  CHEMONT <- json_tib %>%
    dplyr::select(inchikey, dplyr::ends_with(".chemont_id")) %>%
    tidyr::pivot_longer(-inchikey, values_to = "CHEMONT") %>%
    tidyr::separate(col = name, into = c("Level", "TYPE")) %>%
    dplyr::select(-TYPE)

  class_tibble <- left_join(Classification, CHEMONT) %>%
    dplyr::ungroup()

Which gives:

# A tibble: 10 x 4
   inchikey                             Level      Classification           CHEMONT          
   <chr>                                <chr>      <chr>                    <chr>            
 1 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N kingdom    Organic compounds        CHEMONTID:0000000
 2 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N superclass Organic oxygen compounds CHEMONTID:0004603
 3 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N class      Organooxygen compounds   CHEMONTID:0000323
 4 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N subclass   Ethers                   CHEMONTID:0000254
 5 InChIKey=NVJUHMXYKCUMQA-UHFFFAOYSA-N direct     Dialkyl ethers           CHEMONTID:0001167
 6 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N kingdom    Organic compounds        CHEMONTID:0000000
 7 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N superclass Organic oxygen compounds CHEMONTID:0004603
 8 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N class      Organooxygen compounds   CHEMONTID:0000323
 9 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N subclass   Ethers                   CHEMONTID:0000254
10 InChIKey=YCVHIAQANWEUFE-UHFFFAOYSA-N direct     Dialkyl ethers           CHEMONTID:0001167

aberHRML / classyfireR

classification based on SMILES #40