Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Add proper support for many-to-many field searches #1268

Open hepcat72 opened 1 month ago

hepcat72 commented 1 month ago

FEATURE REQUEST

Inspiration

The changes I made to gather all the related mzXML files together (when dealing with placeholder records) makes the advanced search behave in unexpected ways. I was concerned about this... AND FIXED IT, but it took all day and it is somewhat of a hack. Let me describe the problem (which is the result of a lack of support for many-to-many searches).

This all stems from the fact that to get the neighboring MSRunSample records of a placeholder, I am traversing the same table twice... And when I display the MSRunSample.ms_data_file.filename, I just skip the null one from the placeholder.

My testing on my sandbox DB is more discrete, and is easier to understand... It has 408 peak groups and 2 mzXMLs per group and no placeholders. Sorting doubles the rows to 816, so I'm pretty sure that I understand what's going on. Django requires sorting to be preceded by the fields involved being made "distinct", so however many files are involved, there will be that many rows, and since the relationship is 1:M, you get "more rows". I can "solve" that by just removing the ability to sort the results by that column, as sorting by other fields doesn't cause this.

So the problem is then just searching for "MZ Data Filename does not have a value". You get a result row for every peak group that links to a placeholder MSRunSample record, and since we have placeholders linked to the same sample as ones with mzxml files, you get all peakgroups with no mzxml plus all with a placeholder next to ones with mzxmls.

So essentially "does not have a value" doesn't work right for the mz data files. A work-around is to search for "MZ Data Filename does not end with .mzXML". Then you get 72070 records (which is correct: 72070 + 32637 = 104707).

The problem is that when there are multiple mzXML files, there will be a placeholder, so searching for peakgroups linked to an MSRunSample record that has no mzXML, you will always get those that have a placeholder...

I "fixed" this by editing the DataRepo.formats.dataformat_group_query.constructAdvancedQueryHelper method:

        if cmp == "isnull":
            if negate:
                negate = False
                val = False
            else:
                # val = False
                # TODO: The above was commented and changed to the below to handle many-to-many relationships This is a
                # case-specific fix (not comprehensive) for MSRunSample.ms_data_file.  The table is traversed twice to
                # get neighbor records, one of which can be a single null placeholder and the others can be numerous
                # non-null files.  And it is specific for when split_rows iin the format is False.  This basically says,
                # "find me all the non-null records and inverse it to return the complement".  This might be a durable
                # solution, but it is complex and probably *should* look for val = False and have a means to specify ALL
                # linked records are null.
                negate = True
                val = False

So that now:

If I just "showed" the None value in the list of mzXML files in each row, the original code would actually make sense. But what we really want here are MSRunSample records that only have a placeholder and no neighbors, but there's no way to create that query.

Description

What we need here is a way to query for a number of M:M related records using annotations. For fields that are M:M related, as is the MSRunSample.ms_data_file.filename, we can accomplish this by looking for "MZ Data Filename does not have a value and number of MSRunSamples = 1". Or, even better, would be something like "number of MSRunSamples whose MZ Data Filename has a value = 0".

Neither of these suggestions seems ideal in terms of design, but I think it conveys the idea. This needs an elegant understandable design.

Alternatives

None

Dependencies

None

Comment

None


ISSUE OWNER SECTION

Assumptions

  1. List of assumptions that the code will not explicitly address/check
  2. E.g. We will assume input is correct (explaining why there is no validation)

Limitations

  1. A list of things this work will specifically not do
  2. E.g. This feature will only handle the most frequent use case X

Affected Components

Requirements

DESIGN

Interface Change description

None provided

Code Change Description

None provided

Tests

mneinast commented 1 month ago

this is definitely complicated but I think I follow, and the hack to get it working is smart.

How much can this be avoided if we create the dedicated mzXMLs Advanced Search? Like maybe we tolerate the hack on the assumption that most users will use the dedicated mzXMLs Advanced Search to access them?

hepcat72 commented 1 month ago

this is definitely complicated but I think I follow, and the hack to get it working is smart.

How much can this be avoided if we create the dedicated mzXMLs Advanced Search? Like maybe we tolerate the hack on the assumption that most users will use the dedicated mzXMLs Advanced Search to access them?

Let me clarify... that this issue is not here to solve the mzXML search/sort problem. I just use that as an example of why I should make the advanced search support "annotations", e.g. the ability to query the number of related records (and to be able to sort without changing the number of rows). So this is a low-priority issue about underlying generic functionality that applies to all search formats and fields. It would also allow us to include columns that are annotations as a base feature, instead of that being just an aspect of the template that renders the tables.

That aside, I think that the hack is good because we don't have any other instances where we would want (for example) to see the placeholder records.

And incidentally, the "mzXMLs" "format"/search would not be affected by this issue.