Chemical-Curation / chemcurator_django

Backend services for chemical curation
https://api.chemreg.epa.gov
MIT License
1 stars 0 forks source link

Fix test_defined_compound_filters and inchikey filter #224

Closed michael-barbour closed 3 years ago

michael-barbour commented 3 years ago

This is a working branch for fixing test_defined_compound_filters.

It appears as if this test is failing due for two reasons:

Issue one was resolved in this branch by converting SMILES to molfiles before every search.

Issue two is incredibly infrequent and would probably be much harder to track down. I ran each one of the params tests 5,000 times and had 6 failures. Each run through currently generates 2 compounds so the chance of failure per each gen is probably around .02%. Probably not worth much dev time fixing but worth being aware of.

I used pytest-repeat to rerun tests.

14994 passed
6 failed
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[V3000-2104-5000]
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[V3000-2512-5000]
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[V3000-4251-5000]
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[V2000-3358-5000]
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[SMILES-1967-5000]
         - chemreg/compound/tests/test_filters.py:6 test_defined_compound_filters[SMILES-3181-5000]

Testing:

C([H])([H])([H])C(=O)N([H])C1C([H])=C([H])C(/C(/[H])=N/N([H])C(=S)N([H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])=C([H])C=1[H] is an example of a SMILES that fails on staging, but passes on my branch. Create a defined compound with the above smile string then search for it.

Get:

curl --location --request GET '127.0.0.1:8000/definedCompounds?filter[smiles]=C([H])([H])([H])C(=O)N([H])C1C([H])=C([H])C(/C(/[H])=N/N([H])C(=S)N([H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])=C([H])C=1[H]' \
--header 'Authorization: Basic a2FyeW46c3BlY2lhbFBANTV3b3Jk' \
--header 'Cookie: csrftoken=uYUMbbFCLDbtJE39gpJzQN5Ln1njZa3sp0JjtwdNHL4DVAnaiwa8jPp3CelFKFOS'

Create:

curl --location --request POST '127.0.0.1:8000/definedCompounds' \
--header 'Content-Type: application/vnd.api+json' \
--header 'Authorization: Basic a2FyeW46c3BlY2lhbFBANTV3b3Jk' \
--header 'Cookie: csrftoken=uYUMbbFCLDbtJE39gpJzQN5Ln1njZa3sp0JjtwdNHL4DVAnaiwa8jPp3CelFKFOS' \
--data-raw '{
    "data": {
        "type": "definedCompound",
        "attributes": {
            "smiles": "C([H])([H])([H])C(=O)N([H])C1C([H])=C([H])C(/C(/[H])=N/N([H])C(=S)N([H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])=C([H])C=1[H]"
        }
    }
}'
michael-barbour commented 3 years ago

I don't think the test actually had to be changed. I first thought the issue was the way the url_encoding was happening regarding backslashes. If those can/should be rolled back, I would be fine with it.

debboutr commented 3 years ago

I like the idea of converting a SMILES to a molfileV3000 before making an inchikey cause that is how the inchikeys are created in the serializer to begin with, this begs the question though that we should also be converting the V2000 value in the filter to a V3000 before performing the search?

debboutr commented 3 years ago

@michael-barbour in your comment above, it seems like using the molfile string to be filtered that is provided in the URL doesn't need to be cleaned, i.e., mol.replace("-INDIGO-", "-CHEMREG-").replace(" ", "+").replace("\n", "%0A"), if we use it as a second argument to the client.get method? That's pretty sweet if I'm reading into this right.

michael-barbour commented 3 years ago

@michael-barbour in your comment above, it seems like using the molfile string to be filtered that is provided in the URL doesn't need to be cleaned, i.e., mol.replace("-INDIGO-", "-CHEMREG-").replace(" ", "+").replace("\n", "%0A"), if we use it as a second argument to the client.get method? That's pretty sweet if I'm reading into this right.

Yep! It extends Django's testing client. For GET requests the data dictionary is automatically parsed as query params.

The key-value pairs in the data dictionary are used to create a GET data payload. For example:

c = Client()
c.get('/customers/details/', {'name': 'fred', 'age': 7})

…will result in the evaluation of a GET request equivalent to:

/customers/details/?name=fred&age=7

Link to GET section

The one bit I was unsure about was the mol.replace("-INDIGO-", "-CHEMREG-") as that is an additional cleaning that we seem to be doing for test purposes.

debboutr commented 3 years ago

Rick Debbout 2:04 PM when we use a filter should we always convert to molfileV3000 before getting the inchikey? "/definedCompounds?filter[{field}]={molfile}" if the field above is V2000 or SMILES, we should convert to V3000 before getting the inchikey from Indigo, right?

Chris Grulke 2:13 PM Nope, it can be read in and searched without converting to v3000. Just directly export the inchikey.

I think this is going to prevent us from using: value = get_molfile_v3000(value)

in chemreg/compound/filters.py???

cmgrulke commented 3 years ago

You can convert it to a v3000 if it is easier, but it isn't required. Doing so involves two reads from the Indigo toolkit first from the v2000 or SMILES, then from the v3000 into the indigo chemical object model which does have minor differences from the structure storage formats (and therefore on the edges of chemistry can result in some minor inconsistencies in the understanding of the structure).