Sage-Bionetworks / neurolincsdreamchallenge

1 stars 2 forks source link

Double recording of Experiment/Well/ObjectTrackID/TimePoint combination #5

Open philerooski opened 6 years ago

philerooski commented 6 years ago

There are duplicates of these rows.

See output of curated_cell_data %>% filter(Experiment == "AB-CS47iTDP-Survival", Well == "D1", ObjectTrackID == 15) %>% arrange(TimePoint)

1 AB-CS47iTDP-Survival            15   D1         0       true         false
2 AB-CS47iTDP-Survival            15   D1         0       true         false
3 AB-CS47iTDP-Survival            15   D1         1       true         false
4 AB-CS47iTDP-Survival            15   D1         1       true         false
5 AB-CS47iTDP-Survival            15   D1         2       true          true
6 AB-CS47iTDP-Survival            15   D1         2       true          true
7 AB-CS47iTDP-Survival            15   D1         6       true         false
8 AB-CS47iTDP-Survival            15   D1         6       true         false

Or on Synapse: https://www.synapse.org/#!Synapse:syn11378063/tables/query/eyJzcWwiOiJTRUxFQ1QgKiBGUk9NIHN5bjExMzc4MDYzIFdIRVJFICggKCBcIkV4cGVyaW1lbnRcIiA9ICdBQi1DUzQ3aVREUC1TdXJ2aXZhbCcgKSBBTkQgKCBcIldlbGxcIiA9ICdEMScgKSBBTkQgKFwiT2JqZWN0VHJhY2tJRFwiID0gMTUpICkgT1JERVIgQlkgVGltZVBvaW50IiwgImluY2x1ZGVFbnRpdHlFdGFnIjp0cnVlLCAiaXNDb25zaXN0ZW50Ijp0cnVlLCAib2Zmc2V0IjowLCAibGltaXQiOjI1fQ==

The Lost_Tracking == true at TimePoint 2, but Lost_Tracking == false at timepoint 6 also seems unusual. I'm not sure if it's within spec.

philerooski commented 6 years ago

This seems to be a common issue. In some cases the rows are completely identical. In other cases there is a difference in, e.g., the ObjectLabelsFound column.

See this gist for all Experiment/Well/ObjectTrackID/TimePoint combinations that contain duplicates https://gist.github.com/philerooski/f3554b61ef8778d382ed9c561073dab1

jaslinkalra commented 5 years ago

corrected indices for AB-CS47iTDP-Survival here- https://www.synapse.org/#!Synapse:syn17087846/tables/

corrected indices for LINCS2016B here- https://www.synapse.org/#!Synapse:syn17087896/tables/

kdaily commented 5 years ago

Thanks @jaslinkalra!

@philerooski is this enough for you to make corrections?

philerooski commented 5 years ago

I'm still seeing some duplicates in the corrected LINCS2016B file, but let's try something different...

Of the measurements with a non-NA ObjectTrackID, we have 1254 duplicated indices. But of those duplicated indices, only 23 actually differ in some other, non-index column. Meaning most of the duplicates can be dropped consequence-free.

@jaslinkalra I've included the 23 duplicated indices which have column value conflicts here: https://www.synapse.org/#!Synapse:syn17093258

The table is arranged so that columnName.x is the value in the first instance of the duplicate index, and columnName.y is the value of the second, duplicate index. These values won't match for at least one combination of columnName.x/columnName.y. For example, the very first row (Experiment = AB-SOD1-KW4-WTC11-Survival, ObjectTrackID = 1, Well = A1, TimePoint = 0) has a conflict in Lost_Tracking. Lost_Tracking.x is true but Lost_Tracking.y is False.

If we can address these inconsistencies, then we can simply drop the duplicated indices to resolve this issue.

jaslinkalra commented 5 years ago

I sorted through the curation and uploaded correct LINCS062016B curation here-

https://www.synapse.org/#!Synapse:syn17096732/tables/

Sorting through the other datasets

jaslinkalra commented 5 years ago

The correct curation for AB-SOD1-KW4-WTC11 dataset is here- https://www.synapse.org/#!Synapse:syn17933845/tables/

philerooski commented 5 years ago

Great! These should be enough to resolve this issue (once I push the de-duplicated table to Synapse)

philerooski commented 4 years ago

Some of these corrected files introduced more issues.

For example, the original curated cell data:

# A tibble: 14 x 7
   Experiment   Well  TimePoint ObjectTrackID ObjectLabelsFound Live_Cells Mistracked
   <chr>        <chr>     <int>         <int>             <int> <lgl>      <lgl>     
 1 LINCS062016B A3            1            52                52 TRUE       FALSE     
 2 LINCS062016B A3            2            52                52 TRUE       FALSE     
 3 LINCS062016B A3            3            52                52 TRUE       FALSE     
 4 LINCS062016B A3            4            52                52 TRUE       FALSE     
 5 LINCS062016B A3            5            52                52 TRUE       FALSE     
 6 LINCS062016B A3            6            52               137 TRUE       FALSE     
 7 LINCS062016B A3            7            52                52 TRUE       FALSE     
 8 LINCS062016B A3            8            52                52 TRUE       TRUE      
 9 LINCS062016B A3            8            52               165 TRUE       FALSE     
10 LINCS062016B A3            9            52               165 TRUE       FALSE     
11 LINCS062016B A3           11            52               195 TRUE       FALSE     
12 LINCS062016B A3           12            52               205 TRUE       FALSE     
13 LINCS062016B A3           13            52               205 TRUE       FALSE     
14 LINCS062016B A3           14            52               205 TRUE       FALSE 

And the "corrected" rows provided to fix the repeated TimePoint at TimePoint = 8:

# A tibble: 10 x 7
   Experiment   Well  TimePoint ObjectTrackID ObjectLabelsFound Live_Cells Mistracked
   <chr>        <chr>     <int>         <int>             <int> <lgl>      <lgl>     
 1 LINCS062016B A3            1            52                52 TRUE       FALSE     
 2 LINCS062016B A3            2            52                52 TRUE       FALSE     
 3 LINCS062016B A3            3            52                52 TRUE       FALSE     
 4 LINCS062016B A3            4            52                52 TRUE       FALSE     
 5 LINCS062016B A3            5            52                52 TRUE       FALSE     
 6 LINCS062016B A3            6            52                52 TRUE       FALSE     
 7 LINCS062016B A3            7            52                52 FALSE      FALSE     
 8 LINCS062016B A3            8            52                52 FALSE      FALSE     
 9 LINCS062016B A3            9            52                52 FALSE      FALSE     
10 LINCS062016B A3           10            52                52 FALSE      FALSE   

The corrected rows are incomplete (Do not have all the TimePoint values) and replacing the rows in the curated cell data with these introduces a zombie track ( #8 ).

philerooski commented 4 years ago

This table: https://www.synapse.org/#!Synapse:syn17087846/tables/ contained a track with a gap, so I removed that track from the final curated dataset for now.

Experiment == "AB-CS47iTDP-Survival" Well == "D11" ObjectTrackID == 21

has a missing TimePoint = 21