Match accucor/isocor sample names with supplied sample sheet

lparsons commented 1 year ago

FEATURE REQUEST

Inspiration

When loading an Accucor/Isocor file, we only have a sample name to lookup. While we require sample names to be unique in the database, there is no check that the sample details match since we don't have those details, so there is a risk that we will associate the data with wrong sample.

Description

As a best practice, we should require that accucor/isocor submissions be accompanied by an animals/sample sheet. We should then ensure that the samples indicated in the accucor/isocor files match those in the sample sheet and that the sample sheet details are either new samples or match existing ones exactly (all fields, including the animal fields).

Alternatives

A brief description of any alternative features that could accomplish the same ultimate goal, either for consideration or considered and rejected.

Dependencies

This issue cannot be started until the completion of the following issue(s):

<issue number 1>
<issue number 2>

Comment

Add any other context or screenshots about the feature request here.

ISSUE OWNER SECTION

Assumptions

List of assumptions that the code will not explicitly address/check
E.g. We will assume input is correct (explaining why there is no validation)

Requirements

[ ] 1. List of numbered conditions to be met for the feature
[ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
[ ] 3. Numbers for reference & checkboxes for progress tracking

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected Components

A tentative list of anticipated repository items that will be changed, labeled with "add", "delete", or "change". One item per line. (Mostly, this will be a list of files.)

change: File path or DB table ...
add: Environment variable or server setting
delete: External executable or cron job

DESIGN

Interface Change description

Describe changes to usage. E.g. GUI/command-line changes

Code Change Description

Describe code changes planned for the feature. (Pseudocode encouraged)

Tests

[ ] 1. A description of at least one test for each requirement above.
[ ] 2. E.g. Test for req 2 that there's an exception when display value is ''
[ ] 3. Numbers for reference & checkboxes for progress tracking

hepcat72 commented 1 year ago

Just noting this comment, where I have 2 concerns:

Conflating 2 different samples that were from the same researcher/day/protocol that was previously caught and verified via an exception (case 1.1.1. in the outline I created)
It is and has been possible we have silently conflated different samples from different days/researchers/protocols as the same sample, because there's no check that the sample in an accucor file corresponds to a sample in the sample sheet

I should also note that we should correct previously fudged dates in the cold-exposure data.

hepcat72 commented 10 months ago

I think that there are 2 distinct issues here, one of which should be refined. The 2 issues are:

Sample names in accucor/isocorr column headers must be matched to the sample names in the sample sheet [this is technically already happening - the problem is the duplicate samples with different names, and that originates from the fact that the name in the *cor column headers must be unique in a mass spec instrument run and when a sample is run multiple times in different modes in the same instrument run, they add those modes to the sample names to make them unique so that the same samples run in different modes can be distinguished. This requires that we have to match a modified sample name to a sample sheet that represents true/actual samples without those modifications]
Check that the samples in the sample sheet are truly unique and don't allow a sample differing only by name to exist without error.

The problem is that I don't think we can implement both 1 and 2 at the same time with only 1 sample name column. Let me explain... We're not keen on modifying the accucor/isocorr files, so we can't edit those to remove the trailing "_pos" (for example). What I think we need is something like an extra column in the sample sheet to associate 1 sample with multiple accucor/isocorr column headers (in different accucor/isocorr files, but coming possibly from the same instrument run). Those names are coming from the same raw file and I think that the sample names are modified (appended to) in those instances to be able to perform a single instrument run with unique names.

So either item 1 should be modified to be able to associate multiple differing names with a single sample row, or modify item 2 to allow multiple rows that only differ by the accucor/isocorr sample name from the column header (by adding another column to have both "sample name" and "accucor/isocorr sample name column header" (or whatever succinct name makes sense).

hepcat72 commented 10 months ago

This seems to have some overlap with #687. That would be item 2 from above. I think I see the distinction now, upon re-reading...

ensure [...] that the sample sheet details are either new samples or match existing ones exactly (all fields, including the animal fields).

This issue refers to previously loaded samples, as if a separate load loaded them, but that issue is essentially the same. The load script loads as it goes, so conflicts should always be caught in the same way whether there are duplicates in the supplied sheet or previously loaded by a separate load run (though it helps to know the difference). That's the overlap with #687.

lparsons commented 10 months ago

I agree that this issue might be better broken into multiple issues and needs some implementation details added. I can also see that the issues could be clarified a bit. The overall process as I see it should be:

The load_animals_and_samples command should ensure that we don't create records that are identical except for the name attribute. This is covered in #688 and #687.
The load_accucor_msruns command should include a new required argument of the associate animal/sample sheet. A few checks should be made. This is the part covered by this issue (at least that was my intent).
- Ensure that each sample in the accucor file can be matched to a sample in the sample sheet. This may require some "fuzzy matching" of sorts, but there should be no ambiguity as to which row in the sample sheet is associated with the column in the accocor file.
- Ensure that the samples in the sample sheet match existing samples that were previously loaded with load_animals_and_samples. Any mismatch at this point should throw and error.

hepcat72 commented 10 months ago

The load_accucor_msruns command should include a new required argument of the associate animal/sample sheet.

I have a few things to note about this. And my assumption is that the intention of requiring the sample sheet with the loader is to make the association between the samples and the accucor sample headers.

If subsequent instrument runs re-analyze samples used in a previous instrument run / data load, but also add samples, do we duplicate the sample sheet to edit in more samples or add to the previous loaded sample sheet and re-use it?
This problem (the association of new data with previously loaded or "dependent" data) is not restricted/inherent to the sample/header association. It also applies to sample and animal data associated with other load types (tissues, protocols, and compounds), and those are already solved in load_study using atomic transactions. Do we want to require those other load files with the sample table load script? If not, then I think we should solve the problem in a consistent manner.
If we add the "sample alias" proposed in comments in #687 to the MSRun(*) record in the database, then we can avoid solving the same association issue in multiple different ways.

So IMHO, let's add a name field to MSRun(/or equivalent table) for the sample header value. That way, the accucor load script can remain decoupled from the parsing and loading of other file types.

lparsons commented 10 months ago

Yes, I think requiring a copy of the sample sheet for each separate submission of an accucor file is reasonable and worth the price to ensure we associate things with the correct sample.
I believe that sample naming is much more difficult that compound, tissue, or protocol naming. There are a limited set of those items and they are carefully named. Researchers create new sample frequently and use little to guide them in naming the samples. Since accucor files contain only a sample name, it seems prudent to take extra steps to ensure accuracy.
I'm not sure I would like to rely on a previously entered alias name to ensure the sample association is correct. I would rather check to ensure that the sample the accucor file refers to matches an existing record, including the extra attributes.

I'm uncertain about the utility of the sample alias field in the MSRun record, especially if we add the column to the samplesheet with accucor sample name.

hepcat72 commented 10 months ago

Yes, I think requiring a copy of the sample sheet for each separate submission of an accucor file is reasonable and worth the price to ensure we associate things with the correct sample.

I believe that sample naming is much more difficult that compound, tissue, or protocol naming. There are a limited set of those items and they are carefully named. Researchers create new sample frequently and use little to guide them in naming the samples. Since accucor files contain only a sample name, it seems prudent to take extra steps to ensure accuracy.

I'm not sure I would like to rely on a previously entered alias name to ensure the sample association is correct. I would rather check to ensure that the sample the accucor file refers to matches an existing record, including the extra attributes.

I'm uncertain about the utility of the sample alias field in the MSRun record, especially if we add the column to the samplesheet with accucor sample name.

I am having a hard time keeping the various overlapping discussion threads straight. My comment on the submission form PR partially relates to your comment here. It took me 5 minutes to find it so I could link a comment above in that comment.

With my design proposal for #706 (mentioned in that linked comment) proposes a separate LCMS metadata file and references this comment above as to why.
Sample naming is indeed more frequent and difficult, but I will point out that the name contained in the accucor file is not technically the sample name and there are numerous extant cases in the database now that demonstrate that. My proposal in #706 just changes the animal/sample table input requirement to a separate LCMS metadata file. Think of it as a linking table between samples and the accucor file. It associated those sample data headers (which can differ from the sample name) without cluttering the sample table and without need to duplicate sample data in a separate file or modify the sample table of a previous submission. I totally agree that the association needs to be accurate. I'm simply proposing we do it in a separate file to make those accommodations unnecessary.
This is already accomplished via load_study, and in fact avoids incorrect associations by not introducing the opportunity to modify data when duplicating sample data from a previous submission or modifying an old submission to add new data. I think perhaps, you're possibly only considering the initial submission and not the re-analaysis of previously unanalyzed data associated with samples in a previous submission, which causes the problems I'm trying to point out.

Regarding the utility of the sample alias field, I have made this point before. We don't have a succinct way of referring to an MSRun record. It serves the same utility that LCMethod.name serves. LCMethod.name doesn't contain data that doesn't exist in any other field, but it is a short way of referring to that record.

You may be thinking of uniqueness, which is unnecessary given the mxXML file record uniqueness. The sample data header is (almost?) always the same as the mxXML file name. It just doesn't have the extension.

But given my proposal in #706, which conveys the association via an LCMS metadata file, I think that having the header in the record is not essential. Plus, it's not necessarily globally unique either. I just think it's just a decent shorthad way of referring to an MSRun record.

Princeton-LSI-ResearchComputing / tracebase