airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

ADC API junction_aa (and other AA) queries - force upper case? #528

Closed bcorrie closed 8 months ago

bcorrie commented 3 years ago

Currently the ADC API does not (nor does the AIRR spec) say anything about AA and case. Maybe it shouldn't, but we have seen some users typing lower case AA strings. They don't find anything because they are searching for an exact substring match and our AA fields are upper case.

Is it reasonable to force/suggest a standard here? We load AA strings as the annotation tools provide them, which is I think always upper case, so was going to make the Gateway convert all to upper case, but not sure that is the right thing to do...

We don't comment on this here: https://docs.airr-community.org/en/stable/datarep/rearrangements.html

Thoughts?

bcorrie commented 3 years ago

Maybe the better way to think of this is make it case independent... not force one or the other?

bcorrie commented 8 months ago

This makes it complicated for the repositories in a way, as string searchers become complicate, but perhaps not really. The repositories can force this one way or the other internally (store CAP AAs, convert any queries into CAPS) and can then search efficiently. Although we don't force this at the moment in the iReceptor Turnkey, we could, and this issue would I believe go away.

@schristley any objections to not forcing anything, realizing that we might want to optimize this on the repository side (if we aren't already)?

schristley commented 8 months ago

@schristley any objections to not forcing anything, realizing that we might want to optimize this on the repository side (if we aren't already)?

I'm tempted to force uppercase for the repository but allow flexibility for interface. By that I mean the repositories store as uppercase and return the data in uppercase, but queries can be mixed case but are turned into uppercase while performing the query. What I don't want is for data to be stored and/or returned in mixed case because that might cause problems down the road.

The convention is uppercase but there's bound to be some tool sooner or later that doesn't do that. Furthermore, you do see mixed case a lot as you get closer to genomic data.

bcorrie commented 8 months ago

@schristley any objections to not forcing anything, realizing that we might want to optimize this on the repository side (if we aren't already)?

I'm tempted to force uppercase for the repository but allow flexibility for interface. By that I mean the repositories store as uppercase and return the data in uppercase, but queries can be mixed case but are turned into uppercase while performing the query. What I don't want is for data to be stored and/or returned in mixed case because that might cause problems down the road.

100% agree. That is what I think we would do as well. An impact on data loading into a repository, as the AA's would need to be converted, but that is a one time cost and makes query optimization much easier. Repository then returns what it has. Repository also converts any AA queries that it gets that are not in the correct case for the repository into the correct case.

I am not 100% sure but I think the Gateway does this translation already, so any queries from the Gateway will be uppercase, even if the user types lower or mixed case.

bcorrie commented 8 months ago

So I think there is no change required to the spec.

Although I don't think this needs to be documented for the end user, this decision should probably be documented for repository and API builders. I will add a comment in the docs...

schristley commented 8 months ago

Interesting enough, LinkML allows patterns to be defined, and this is a situation where that would be useful because besides wanting to enforce the uppercase convention, only certain letters should be in junction_aa, specifically the letters for amino acids.

JSON schema also supports patterns as well, so we could technically add a pattern to the spec to define this constraint. We've not been using them in the AIRR spec though, so we might want to be cautious about adding before we do some testing with our tool chains. I can think of two scenarios that we need to think more

  1. unproductive rearrangements sometimes have a junction_aa provided with invalid letters like stop codon.
  2. some tools might produce junction_aa going beyond the standard letters and use the letters that cover multiple amino acids. For example, B for Aspartic acid D or Asparagine N. if a user gives a query using B then they might expect it to return results with D and N as well.

Putting my AKC hat on, this falls under both the data modeling work and the validation work. We will want these stricter validation rules in the AKC and as part of the iterative process, and push stuff back into AIRR spec that make sense.

Also the use case of querying with B as described above is something that I intend the AKC to support, but note that will be on Receptors, not on Rearrangement's junction_aa. I don't think the ADC needs to necessarily support that.

schristley commented 8 months ago

@bcorrie this issue is specifically about uppercase and documenting the API, and what you've done is probably sufficient for the V2 and closing it. I'd like not to lose some of my thoughts though, what would you suggest as a process? I'm thinking that we create AK issues to track and later create new AIRR issues when we are ready to push back proposed changes.

bcorrie commented 8 months ago

Docs added, closing this issue.