Closed gpetho closed 2 months ago
UPDATE: As Levente has reminded me in today's weekly meeting, checking for a duplicate sampleID
is not the right solution. There are cases where two records have the same sample id, namely, if the two records describe the results of different sequencing runs. So in these cases, the fields describing the properties of the sample being sequenced, including the sample id, should be identical, but the concatenation of the run_directory
plus the barcode
need to be different. (It might be the case that the run_directory
is the same for the two lines but the barcode
is different if the sample is sequenced twice during the same run, and it is also possible that the run_directory
is different but the barcode
is the same if the two samples accidentally get the same barcode during different runs, since the barcode is only supposed to be unique within a run.)
So what should be passed to the counter is not the list of sample ids, but rather [s + d + b for s, d, b in zip(sample_ids, run_directories, barcodes)]
, assuming that sample_ids
etc. are lists or more generally iterators over the elements of the sampleID
, run_directory
and barcode
columns of the pandas df.
Since you're doing this anyway, please check whether there are duplicates of the concatenation of run_directory
and barcode
if they are not empty, since these should not be possible either. So when adding elements to the counter, ignore empty fields. Furthermore, check whether are records that contain run_directory
but no barcode
, or contain a barcode
but no run_directory
. These shouldn't be possible either, so we need a respective error message for each of these situations.
Are sampleID, run_directory and barcode necessary in all cases? Or if there are duplicates of sampleID, run_directory and barcode are necessary?
I added two checks. Check the combination of sampleID, run_directory and barcode is unique. Check sampleID, run_directory and barcode are not empty.
Are sampleID, run_directory and barcode necessary in all cases?
sampleID is necessary. run_directory and barcode can be empty. However, there is one other thing to be checked: if either of the last two is specified, then they both must be specified. So either the concatenation of run_directory + barcode is the empty string, or if not, then they both need to be non-empty strings.
Or if there are duplicates of sampleID, run_directory and barcode are necessary?
This is correct. If sampleID is duplicated, run_directory and barcode should not be empty. As I said, just verify whether the concatenation of sampleID + run_directory + barcode is unique to make sure that this condition is satisfied. (But correct me if I'm wrong about this.)
This issue was resolved by the following commits: fix: check sampleID duplicates -> check sampleID + run_directory + barcode, check those are not empty fix: check empty and duplicate sampleID
This issue was resolved by the following commits: fix: fix check for sampleID, run_directory and barcode
As discussed in today's meeting, just checking the data type of the sample id field is not sufficient. The validation should fail with an appropriate error message if the sample id is empty or if it is not unique within the current file, i.e. if there is another record in the table that has the same sample id. There are many ways to achieve this. The most efficient in terms of computation is probably putting the sample ids into a list, sorting it, and then iterating over its elements pairwise to check whether the element at
i
is equal to the element ati+1
. A slightly less efficient but more transparent way to do it is to create aCounter
objectc
(from thecollections
module) using the list of sample ids. The list of duplicate sample_ids is[key for key in c if c[key] > 1]
. Implement the error message according to this.