Closed lparsons closed 8 months ago
Just noting this comment, where I have 2 concerns:
1.1.1.
in the outline I created)I should also note that we should correct previously fudged dates in the cold-exposure data.
I think that there are 2 distinct issues here, one of which should be refined. The 2 issues are:
The problem is that I don't think we can implement both 1 and 2 at the same time with only 1 sample name column. Let me explain... We're not keen on modifying the accucor/isocorr files, so we can't edit those to remove the trailing "_pos" (for example). What I think we need is something like an extra column in the sample sheet to associate 1 sample with multiple accucor/isocorr column headers (in different accucor/isocorr files, but coming possibly from the same instrument run). Those names are coming from the same raw file and I think that the sample names are modified (appended to) in those instances to be able to perform a single instrument run with unique names.
So either item 1 should be modified to be able to associate multiple differing names with a single sample row, or modify item 2 to allow multiple rows that only differ by the accucor/isocorr sample name from the column header (by adding another column to have both "sample name" and "accucor/isocorr sample name column header" (or whatever succinct name makes sense).
This seems to have some overlap with #687. That would be item 2 from above. I think I see the distinction now, upon re-reading...
ensure [...] that the sample sheet details are either new samples or match existing ones exactly (all fields, including the animal fields).
This issue refers to previously loaded samples, as if a separate load loaded them, but that issue is essentially the same. The load script loads as it goes, so conflicts should always be caught in the same way whether there are duplicates in the supplied sheet or previously loaded by a separate load run (though it helps to know the difference). That's the overlap with #687.
I agree that this issue might be better broken into multiple issues and needs some implementation details added. I can also see that the issues could be clarified a bit. The overall process as I see it should be:
load_animals_and_samples
command should ensure that we don't create records that are identical except for the name attribute. This is covered in #688 and #687.load_accucor_msruns
command should include a new required argument of the associate animal/sample sheet. A few checks should be made. This is the part covered by this issue (at least that was my intent).
load_animals_and_samples
. Any mismatch at this point should throw and error.The
load_accucor_msruns
command should include a new required argument of the associate animal/sample sheet.
I have a few things to note about this. And my assumption is that the intention of requiring the sample sheet with the loader is to make the association between the samples and the accucor sample headers.
load_study
using atomic transactions. Do we want to require those other load files with the sample table load script? If not, then I think we should solve the problem in a consistent manner.sample alias
" proposed in comments in #687 to the MSRun(*) record in the database, then we can avoid solving the same association issue in multiple different ways.So IMHO, let's add a name field to MSRun(/or equivalent table) for the sample header value. That way, the accucor load script can remain decoupled from the parsing and loading of other file types.
I'm uncertain about the utility of the sample alias field in the MSRun record, especially if we add the column to the samplesheet with accucor sample name.
- Yes, I think requiring a copy of the sample sheet for each separate submission of an accucor file is reasonable and worth the price to ensure we associate things with the correct sample.
- I believe that sample naming is much more difficult that compound, tissue, or protocol naming. There are a limited set of those items and they are carefully named. Researchers create new sample frequently and use little to guide them in naming the samples. Since accucor files contain only a sample name, it seems prudent to take extra steps to ensure accuracy.
- I'm not sure I would like to rely on a previously entered alias name to ensure the sample association is correct. I would rather check to ensure that the sample the accucor file refers to matches an existing record, including the extra attributes.
I'm uncertain about the utility of the sample alias field in the MSRun record, especially if we add the column to the samplesheet with accucor sample name.
I am having a hard time keeping the various overlapping discussion threads straight. My comment on the submission form PR partially relates to your comment here. It took me 5 minutes to find it so I could link a comment above in that comment.
load_study
, and in fact avoids incorrect associations by not introducing the opportunity to modify data when duplicating sample data from a previous submission or modifying an old submission to add new data. I think perhaps, you're possibly only considering the initial submission and not the re-analaysis of previously unanalyzed data associated with samples in a previous submission, which causes the problems I'm trying to point out.Regarding the utility of the sample alias field, I have made this point before. We don't have a succinct way of referring to an MSRun record. It serves the same utility that LCMethod.name
serves. LCMethod.name
doesn't contain data that doesn't exist in any other field, but it is a short way of referring to that record.
You may be thinking of uniqueness, which is unnecessary given the mxXML file record uniqueness. The sample data header
is (almost?) always the same as the mxXML file name. It just doesn't have the extension.
But given my proposal in #706, which conveys the association via an LCMS metadata file, I think that having the header in the record is not essential. Plus, it's not necessarily globally unique either. I just think it's just a decent shorthad way of referring to an MSRun record.
FEATURE REQUEST
Inspiration
When loading an Accucor/Isocor file, we only have a sample name to lookup. While we require sample names to be unique in the database, there is no check that the sample details match since we don't have those details, so there is a risk that we will associate the data with wrong sample.
Description
As a best practice, we should require that accucor/isocor submissions be accompanied by an animals/sample sheet. We should then ensure that the samples indicated in the accucor/isocor files match those in the sample sheet and that the sample sheet details are either new samples or match existing ones exactly (all fields, including the animal fields).
Alternatives
A brief description of any alternative features that could accomplish the same ultimate goal, either for consideration or considered and rejected.
Dependencies
This issue cannot be started until the completion of the following issue(s):
<issue number 1>
<issue number 2>
Comment
Add any other context or screenshots about the feature request here.
ISSUE OWNER SECTION
Assumptions
Requirements
Limitations
Affected Components
A tentative list of anticipated repository items that will be changed, labeled with "add", "delete", or "change". One item per line. (Mostly, this will be a list of files.)
DESIGN
Interface Change description
Describe changes to usage. E.g. GUI/command-line changes
Code Change Description
Describe code changes planned for the feature. (Pseudocode encouraged)
Tests