DEpt-metagenom / MetagenoMongo

1 stars 2 forks source link

Check for empty and duplicate sample id #22

Closed gpetho closed 2 months ago

gpetho commented 2 months ago

As discussed in today's meeting, just checking the data type of the sample id field is not sufficient. The validation should fail with an appropriate error message if the sample id is empty or if it is not unique within the current file, i.e. if there is another record in the table that has the same sample id. There are many ways to achieve this. The most efficient in terms of computation is probably putting the sample ids into a list, sorting it, and then iterating over its elements pairwise to check whether the element at i is equal to the element at i+1. A slightly less efficient but more transparent way to do it is to create a Counter object c (from the collections module) using the list of sample ids. The list of duplicate sample_ids is [key for key in c if c[key] > 1]. Implement the error message according to this.

gpetho commented 2 months ago

UPDATE: As Levente has reminded me in today's weekly meeting, checking for a duplicate sampleID is not the right solution. There are cases where two records have the same sample id, namely, if the two records describe the results of different sequencing runs. So in these cases, the fields describing the properties of the sample being sequenced, including the sample id, should be identical, but the concatenation of the run_directory plus the barcode need to be different. (It might be the case that the run_directory is the same for the two lines but the barcode is different if the sample is sequenced twice during the same run, and it is also possible that the run_directory is different but the barcode is the same if the two samples accidentally get the same barcode during different runs, since the barcode is only supposed to be unique within a run.)

So what should be passed to the counter is not the list of sample ids, but rather [s + d + b for s, d, b in zip(sample_ids, run_directories, barcodes)], assuming that sample_ids etc. are lists or more generally iterators over the elements of the sampleID, run_directory and barcode columns of the pandas df.

Since you're doing this anyway, please check whether there are duplicates of the concatenation of run_directory and barcode if they are not empty, since these should not be possible either. So when adding elements to the counter, ignore empty fields. Furthermore, check whether are records that contain run_directory but no barcode, or contain a barcode but no run_directory. These shouldn't be possible either, so we need a respective error message for each of these situations.

iwmstjp commented 2 months ago

Are sampleID, run_directory and barcode necessary in all cases? Or if there are duplicates of sampleID, run_directory and barcode are necessary?

iwmstjp commented 2 months ago

I added two checks. Check the combination of sampleID, run_directory and barcode is unique. Check sampleID, run_directory and barcode are not empty.

gpetho commented 2 months ago

Are sampleID, run_directory and barcode necessary in all cases?

sampleID is necessary. run_directory and barcode can be empty. However, there is one other thing to be checked: if either of the last two is specified, then they both must be specified. So either the concatenation of run_directory + barcode is the empty string, or if not, then they both need to be non-empty strings.

Or if there are duplicates of sampleID, run_directory and barcode are necessary?

This is correct. If sampleID is duplicated, run_directory and barcode should not be empty. As I said, just verify whether the concatenation of sampleID + run_directory + barcode is unique to make sure that this condition is satisfied. (But correct me if I'm wrong about this.)

iwmstjp commented 2 months ago

This issue was resolved by the following commits: fix: check sampleID duplicates -> check sampleID + run_directory + barcode, check those are not empty fix: check empty and duplicate sampleID

iwmstjp commented 2 months ago

This issue was resolved by the following commits: fix: fix check for sampleID, run_directory and barcode