airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Expected behavior of `facets` on array of string #617

Closed bussec closed 6 months ago

bussec commented 2 years ago

What is the expected behavior when an ADC API query requests aggregation (via the facets request parameter) on a field that holds an array of strings, e.g., study.keywords_study. For example if such a field holds the array ['A', 'B', 'D'] should the aggregation

  1. Increase the count for the each of the strings independently {'A':1, 'B':1, 'D':1}, or
  2. Count the joint occurrence of the strings {'A,B,D':1}?

Note that the example provided in the docs does not match this case 1:1 as pcr_target is an array of objects, and pcr_target_locus contains only a single string.

schristley commented 2 years ago

Note that the example provided in the docs does not match this case 1:1 as pcr_target is an array of objects, and pcr_target_locus contains only a single string.

Conceptually they are the same. In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings. Another way to answer the question though is:

  1. Increase the count for the each of the strings independently {'A':1, 'B':1, 'D':1}, or

This, because it's most useful/common statistics and easy to parse for the client program.

  1. Count the joint occurrence of the strings {'A,B,D':1}?

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

bussec commented 2 years ago

In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings.

Where is the collapsing performed? What happens if there are multiple pcr_target_locus records? It is nice if Mongo handles these things automatically, but for sciReptor we need to recreate this behavior explicitly.

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

According to the GDC documentation, this is not possible (see limitation 2):

https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets

Otherwise I am fine with the behavior described in option 1.

schristley commented 2 years ago

In the example, because it is always the same field of pcr_target_locus, the rest of the object is irrelevant, and it essentially collapses to an array of strings.

Where is the collapsing performed? What happens if there are multiple pcr_target_locus records? It is nice if Mongo handles these things automatically, but for sciReptor we need to recreate this behavior explicitly.

Let me explain with an SQL example, if that helps. It depends how you store pcr_target records but let's assume the standard relational design of having them in their own table. Thus, the pcr_target table has some identifier fields, the pcr_target_locus field and the forward/reverse primer location fields. A query asking if there is TRB locus:

select * from samples s, pcr_target_locus p where s.id == p.id and p.pcr_target_locus == 'TRB'

I'm ignoring the specifics about the id for joining the tables and restricting to a specific repertoire/sample. If this query returns no records for a specific repertoire/sample, then no TRB locus for the sample. If this query returns one or more records, then there is a TRB locus for the sample.

Combining a query like that with GROUP BY, DISTINCT and COUNT can get you pretty close to the facets result.

This is interesting too but more specialized and not easy for a client to parse. This is better handled by adding a filter to the facet query, i.e. (A and B and D).

According to the GDC documentation, this is not possible (see limitation 2):

https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/#facets

I was never sure why the GDC had that limitation, but the ADC does not have it, you should be able to have any filter on a facets. All the filter does is restrict to a subset of repertoire records, so it's essentially independent of the facets operation.

In the SQL world, that may mean you need to chain SELECT statements, i.e. one SELECT to do the filtering and another SELECT which operates on the first's results to do the facets.

scharch commented 1 year ago

@bussec can this be closed?

bussec commented 1 year ago

No, this information needs to be included in the docs (especially the difference to GDC).

bcorrie commented 6 months ago

I have updated the Docs to reflect the GDC difference. Closing this issue.