Do not allow duplicate samples with different names

lparsons commented 1 year ago

FEATURE REQUEST

Inspiration

Biological samples are defined by the metadata associated with them. If two samples are identical in every way except the name, they are very likely errant duplicates.

Description

When loading sample records, if an incoming sample matches an existing sample exactly but has a different name, the loader should raise an exception and fail to load the sample.

Alternatives

Dependencies

Comment

If, at some point in the future, we discover a need to handle biological replicate samples, a replicate field should be added to the sample model to allow researchers to explicitly define replicates.

ISSUE OWNER SECTION

Assumptions

List of assumptions that the code will not explicitly address/check
E.g. We will assume input is correct (explaining why there is no validation)

Requirements

[ ] 1. List of numbered conditions to be met for the feature
[ ] 2. E.g. Every column/row must display a value, i.e. cannot be empty
[ ] 3. Numbers for reference & checkboxes for progress tracking

Limitations

A list of things this work will specifically not do
E.g. This feature will only handle the most frequent use case X

Affected Components

A tentative list of anticipated repository items that will be changed, labeled with "add", "delete", or "change". One item per line. (Mostly, this will be a list of files.)

change: File path or DB table ...
add: Environment variable or server setting
delete: External executable or cron job

DESIGN

Interface Change description

Describe changes to usage. E.g. GUI/command-line changes

Code Change Description

Describe code changes planned for the feature. (Pseudocode encouraged)

Tests

[ ] 1. A description of at least one test for each requirement above.
[ ] 2. E.g. Test for req 2 that there's an exception when display value is ''
[ ] 3. Numbers for reference & checkboxes for progress tracking

hepcat72 commented 1 year ago

As related to my comments on #684, I don't think we can just raise an error here. The same samples have different/multiple names in the accucor/isocorr files for reasons outside of our control. What we need to do is implement a way to make the association of multiple different names (e.g. with or without various suffixes like "_pos") from the headers with the single sample representation in the sample sheet.

I think this issue should be refined to deal with this in a way other than raising an error. If all we do is raise an error, we would either have to modify the accucor/isocorr headers or add another column for those headers in the sample sheet (and change the rules for preventing/loading duplicates in issue #684).

lparsons commented 1 year ago

I think we should endeavor to keep each part of loading loosely coupled with one another. Each piece should be responsible for ensuring that the specified records already exist (and match correctly) or are new records created properly. So, first ensure all of the animal records are created, if there is a problem, we can throw and error there. Then, ensure the samples are created properly. Finally, we'll move on to the data files.

Once we are loading the data files, you are correct that we need to match the records. My proposal is to require that the load_accucor_msruns command accept a sample sheet and require a match between accucor sample columns and rows in the sample sheet. At this point, we'll have smaller set of possible samples to search through and we can implement some "fuzzy matching" if we need. See my comment here: https://github.com/Princeton-LSI-ResearchComputing/tracebase/issues/684#issuecomment-1737584110

hepcat72 commented 1 year ago

I think we agree on the spirit of the changes I suggest to the issue, so the rest is an implementation/design discussion...

The following isn't a proposal. No need to respond. I'm just thinking this through out loud...

I had thought about some matching heuristics (like remove some set of known suffixes), but my concern about that is that it's fragile. One idea I had, which would possibly be a bit more robust, (I documented it either in a comment in the linked issue or some other issue - don't recall offhand which), was to implement a business rule that enforces a sample name cannot either contain or start with an existing sample name. But even that is flawed. How would we know when that is intentional versus unintentional... and what if a sample name without suffixes isn't present (e.g. the associated accucor file was thrown out)?

I think that an ideal solution would be to make the association intentional/explicit, and while I lament the loss in simplicity of the compilation of the submission, introducing data that makes the association would add both intentionality and robustness.

I'll think about it. Perhaps there's a way to obfuscate the issue, such as adding the header to sample association in the yaml (when necessary), or adding an optional sheet that records the association. In fact, we can still do the fuzzy mapping, but have it result in a sheet/file that can be editable (so that it's intentional).

lparsons commented 1 year ago

I think that an ideal solution would be to make the association intentional/explicit, and while I lament the loss in simplicity of the compilation of the submission, introducing data that makes the association would add both intentionality and robustness.

My feeling is that doing this would be simple to implement and very flexible. I would propose that we first attempt to match on the sample name column, if that matches exactly, great. If not, we allow for an optional column in the sample table that indicates the sample identifier used in the data file (Accucor file).

hepcat72 commented 1 year ago

Agree.

hepcat72 commented 1 year ago

I have struggled with what header to use for that column in the sample sheet... sample alias, sample data header, accucor/isocorr sample header, sample peak annotation header, or as you used above: sample identifier... none seem either succinct or intuitive. Perhaps sample alias along with some documentation would be the best.

And I'll also point out that we can either make that cell have a delimiter with multiple headers for different *cor files OR we can have it duplicate the sample rows...

lparsons commented 1 year ago

A header name that helps identify the label as being associated with data files is good. Something like sample data header or sample datafile id

I would probably allow a delimiter in that cell to handle cases where the sample maps to different names in different data files.

hepcat72 commented 11 months ago

Just FYI... I'm working on this, but there appear to be a lot of tests that run afoul of this restriction that sample metadata must not differ only by name, so this is going to be a bigger issue than I'd anticipated.

hepcat72 commented 11 months ago

OK, @Princeton-LSI-ResearchComputing/developers - I ran into a problem with this issue. Some samples will have the same metadata and different sample names when the tissue is "tumor_nonspecific". It appears that at least 1 study encoded multiple tumor locations/characteristics in the sample name:

Tum_Half_a146
Tum_Perif_a146
Tumor_half_a147
Tumor_perif_a147
Tum_Half_a148
Tum_Perif_a148
Tum_Half_a149
Tum_Perif_a149
Tumor_half_a150
Tumor_perif_a150
Tumor_half_a151
Tumor_perif_a151

A few of these are from the same date, researcher, time collected, animal, and of course: tissue. But if appears that perhaps there were different tumors involved perhaps?

Maybe, assuming we don't need to support multiple tumors from the same tissue, we could add an is_tumor boolean field, or have the tissue loader always create a duplicate record of each tissue whose name has _tumor appended (except for the tumor_unspecified tissue)?

lparsons commented 11 months ago

Thanks for looking at this @hepcat72. Some quick comments:

Perhaps we should consider cleaning up the tests before tackling this issue.
Multiple tumors from the same sample was not something we considered when we designed this check. Essentially, these are different tissue samples that we are not tracking.

While implementing this will help to reduce the chance of sample entry error, it's not clear that it is truly needed to proceed. Given the issues mentioned above, I feel like this issue deserves more thought and design time before implementing.

Princeton-LSI-ResearchComputing / tracebase