Open philerooski opened 6 years ago
This seems to be a common issue. In some cases the rows are completely identical. In other cases there is a difference in, e.g., the ObjectLabelsFound column.
See this gist for all Experiment/Well/ObjectTrackID/TimePoint combinations that contain duplicates https://gist.github.com/philerooski/f3554b61ef8778d382ed9c561073dab1
corrected indices for AB-CS47iTDP-Survival here- https://www.synapse.org/#!Synapse:syn17087846/tables/
corrected indices for LINCS2016B here- https://www.synapse.org/#!Synapse:syn17087896/tables/
Thanks @jaslinkalra!
@philerooski is this enough for you to make corrections?
I'm still seeing some duplicates in the corrected LINCS2016B file, but let's try something different...
Of the measurements with a non-NA ObjectTrackID, we have 1254 duplicated indices. But of those duplicated indices, only 23 actually differ in some other, non-index column. Meaning most of the duplicates can be dropped consequence-free.
@jaslinkalra I've included the 23 duplicated indices which have column value conflicts here: https://www.synapse.org/#!Synapse:syn17093258
The table is arranged so that columnName.x is the value in the first instance of the duplicate index, and columnName.y is the value of the second, duplicate index. These values won't match for at least one combination of columnName.x/columnName.y. For example, the very first row (Experiment = AB-SOD1-KW4-WTC11-Survival, ObjectTrackID = 1, Well = A1, TimePoint = 0) has a conflict in Lost_Tracking. Lost_Tracking.x is true but Lost_Tracking.y is False.
If we can address these inconsistencies, then we can simply drop the duplicated indices to resolve this issue.
I sorted through the curation and uploaded correct LINCS062016B curation here-
https://www.synapse.org/#!Synapse:syn17096732/tables/
Sorting through the other datasets
The correct curation for AB-SOD1-KW4-WTC11 dataset is here- https://www.synapse.org/#!Synapse:syn17933845/tables/
Great! These should be enough to resolve this issue (once I push the de-duplicated table to Synapse)
Some of these corrected files introduced more issues.
For example, the original curated cell data:
# A tibble: 14 x 7
Experiment Well TimePoint ObjectTrackID ObjectLabelsFound Live_Cells Mistracked
<chr> <chr> <int> <int> <int> <lgl> <lgl>
1 LINCS062016B A3 1 52 52 TRUE FALSE
2 LINCS062016B A3 2 52 52 TRUE FALSE
3 LINCS062016B A3 3 52 52 TRUE FALSE
4 LINCS062016B A3 4 52 52 TRUE FALSE
5 LINCS062016B A3 5 52 52 TRUE FALSE
6 LINCS062016B A3 6 52 137 TRUE FALSE
7 LINCS062016B A3 7 52 52 TRUE FALSE
8 LINCS062016B A3 8 52 52 TRUE TRUE
9 LINCS062016B A3 8 52 165 TRUE FALSE
10 LINCS062016B A3 9 52 165 TRUE FALSE
11 LINCS062016B A3 11 52 195 TRUE FALSE
12 LINCS062016B A3 12 52 205 TRUE FALSE
13 LINCS062016B A3 13 52 205 TRUE FALSE
14 LINCS062016B A3 14 52 205 TRUE FALSE
And the "corrected" rows provided to fix the repeated TimePoint at TimePoint = 8:
# A tibble: 10 x 7
Experiment Well TimePoint ObjectTrackID ObjectLabelsFound Live_Cells Mistracked
<chr> <chr> <int> <int> <int> <lgl> <lgl>
1 LINCS062016B A3 1 52 52 TRUE FALSE
2 LINCS062016B A3 2 52 52 TRUE FALSE
3 LINCS062016B A3 3 52 52 TRUE FALSE
4 LINCS062016B A3 4 52 52 TRUE FALSE
5 LINCS062016B A3 5 52 52 TRUE FALSE
6 LINCS062016B A3 6 52 52 TRUE FALSE
7 LINCS062016B A3 7 52 52 FALSE FALSE
8 LINCS062016B A3 8 52 52 FALSE FALSE
9 LINCS062016B A3 9 52 52 FALSE FALSE
10 LINCS062016B A3 10 52 52 FALSE FALSE
The corrected rows are incomplete (Do not have all the TimePoint values) and replacing the rows in the curated cell data with these introduces a zombie track ( #8 ).
This table: https://www.synapse.org/#!Synapse:syn17087846/tables/ contained a track with a gap, so I removed that track from the final curated dataset for now.
Experiment == "AB-CS47iTDP-Survival" Well == "D11" ObjectTrackID == 21
has a missing TimePoint = 21
There are duplicates of these rows.
See output of
curated_cell_data %>% filter(Experiment == "AB-CS47iTDP-Survival", Well == "D1", ObjectTrackID == 15) %>% arrange(TimePoint)
Or on Synapse: https://www.synapse.org/#!Synapse:syn11378063/tables/query/eyJzcWwiOiJTRUxFQ1QgKiBGUk9NIHN5bjExMzc4MDYzIFdIRVJFICggKCBcIkV4cGVyaW1lbnRcIiA9ICdBQi1DUzQ3aVREUC1TdXJ2aXZhbCcgKSBBTkQgKCBcIldlbGxcIiA9ICdEMScgKSBBTkQgKFwiT2JqZWN0VHJhY2tJRFwiID0gMTUpICkgT1JERVIgQlkgVGltZVBvaW50IiwgImluY2x1ZGVFbnRpdHlFdGFnIjp0cnVlLCAiaXNDb25zaXN0ZW50Ijp0cnVlLCAib2Zmc2V0IjowLCAibGltaXQiOjI1fQ==
The Lost_Tracking == true at TimePoint 2, but Lost_Tracking == false at timepoint 6 also seems unusual. I'm not sure if it's within spec.