Sage-Bionetworks / neurolincsdreamchallenge

1 stars 2 forks source link

adjust data to account for all time points and appropriate labeling of Lost_Tracking #3

Closed kdaily closed 4 years ago

kdaily commented 6 years ago

Current data (https://www.synapse.org/#!Synapse:syn11378063/tables/) has missing time points per Experiment + Well + Object. The Lost_Tracking column is inappropriately used to indicate that the next timepoint (and subsequent ones until a manually curated track comes back) is lost.

We need to transform this to:

  1. Fill in missing time points. This table (https://www.synapse.org/#!Synapse:syn11817859/tables/) has the start and end time point for each experiment that can be used to determine how many time points there should be for each object.
  2. Change the Lost_Tracking where it's True to False, and the missing time points get Lost_Tracking = True.

Task:

Make a new table with these changes in R - do not modify existing table.

philerooski commented 6 years ago

I don't have access to syn11817859

kdaily commented 6 years ago

Hi @philerooski, where does this stand?

philerooski commented 6 years ago

I'll have some time later today to implement the changes described in the September 24th email chain.

philerooski commented 6 years ago

At least one object has Live_Cells = T, Lost_Tracking = F for some timepoint, the next timepoint is NA for both values, then the next few timepoints after that are also Live_Cells = T, Lost_Tracking = F until supposedly the cell dies and the rest of the time points are missing.

# A tibble: 25 x 6
   Experiment           TimePoint ObjectTrackID Well  Live_Cells Lost_Tracking
   <chr>                    <int>         <int> <chr> <chr>      <chr>        
 1 AB-CS47iTDP-Survival         0             7 A1    true       false        
 2 AB-CS47iTDP-Survival         1             7 A1    NA         NA           
 3 AB-CS47iTDP-Survival         2             7 A1    true       false        
 4 AB-CS47iTDP-Survival         3             7 A1    true       false        
 5 AB-CS47iTDP-Survival         4             7 A1    true       false        
 6 AB-CS47iTDP-Survival         5             7 A1    true       false        
 7 AB-CS47iTDP-Survival         6             7 A1    true       false        
 8 AB-CS47iTDP-Survival         7             7 A1    true       false        
 9 AB-CS47iTDP-Survival         8             7 A1    NA         NA           
10 AB-CS47iTDP-Survival         9             7 A1    NA         NA           
# ... with 15 more rows

What to do here? Sometimes the gap in data is larger than a single timepoint.

philerooski commented 6 years ago

Another strange case. Lost_Tracking is labeled as true but all the timepoints are present. Maybe the Lost_Tracking label at timepoint 7 is incorrect?

     Experiment ObjectTrackID Well TimePoint Live_Cells Lost_Tracking
1  LINCS062016B            38  A11         1       true         false
2  LINCS062016B            38  A11         2       true         false
3  LINCS062016B            38  A11         3       true         false
4  LINCS062016B            38  A11         4       true         false
5  LINCS062016B            38  A11         5       true         false
6  LINCS062016B            38  A11         6       true         false
7  LINCS062016B            38  A11         7       true          true
8  LINCS062016B            38  A11         8      false         false
9  LINCS062016B            38  A11         9      false         false
10 LINCS062016B            38  A11        10      false         false
11 LINCS062016B            38  A11        11      false         false
12 LINCS062016B            38  A11        12      false         false
13 LINCS062016B            38  A11        13      false         false
14 LINCS062016B            38  A11        14      false         false

On Synapse: https://www.synapse.org/#!Synapse:syn11378063/tables/query/eyJzcWwiOiJTRUxFQ1QgKiBGUk9NIHN5bjExMzc4MDYzIFdIRVJFICggKCBcIkV4cGVyaW1lbnRcIiA9ICdMSU5DUzA2MjAxNkInICkgQU5EICggXCJXZWxsXCIgPSAnQTExJyApIEFORCAoXCJPYmplY3RUcmFja0lEXCIgPSAzOCkgKSBPUkRFUiBCWSBUaW1lUG9pbnQiLCAiaW5jbHVkZUVudGl0eUV0YWciOnRydWUsICJpc0NvbnNpc3RlbnQiOnRydWUsICJvZmZzZXQiOjAsICJsaW1pdCI6MjV9

Another confusing example of this from the same well/experiment:

    Experiment ObjectTrackID Well TimePoint Live_Cells Lost_Tracking
1 LINCS062016B            58  A11         1       true         false
2 LINCS062016B            58  A11         2       true         false
3 LINCS062016B            58  A11         3       true         false
4 LINCS062016B            58  A11         4       true         false
5 LINCS062016B            58  A11         5       true         false
6 LINCS062016B            58  A11         6       true          true
7 LINCS062016B            58  A11         7      false         false

Seems to me to be a data munging error caused by the cells death.

kdaily commented 6 years ago

@philerooski the first example seems to be a manual error. The other two @jaslinkalra is still looking into.

Can we determine a filter or rule to identify anything else that looks like these?

  1. I think I actually have the code to do the first using a run length encoding strategy - I will commit what I have here and highlight it for you.
  2. The second case seems pretty easy to do that (has all timepoints but > 1 is Lost_Tracking = true). I think I actually have the code to do the first using a run length encoding strategy - I will commit what I have here and highlight it for you.
  3. Third case would identify any object that has a timepoint with consecutive Live_Cells and Lost_Tracking switching from both true to both false. Not trivial but doable?
philerooski commented 6 years ago

I'm going to handcomb through this table of different Live_Cell/Lost_Tracking combinations and create separate issues for each type of anomolous record I come across.

   Live_Cells Lost_Tracking previous_live_cells previous_lost_tracking
1       FALSE         FALSE                  NA                     NA
2          NA            NA               FALSE                  FALSE
3          NA            NA                  NA                     NA
4        TRUE          TRUE                  NA                     NA
5          NA            NA                TRUE                   TRUE
6        TRUE         FALSE                  NA                     NA
7        TRUE         FALSE                TRUE                  FALSE
8          NA            NA                TRUE                  FALSE
9        TRUE          TRUE                TRUE                  FALSE
10      FALSE         FALSE               FALSE                  FALSE
11       TRUE          TRUE                TRUE                   TRUE
12       TRUE         FALSE               FALSE                  FALSE
13      FALSE         FALSE                TRUE                  FALSE
14      FALSE         FALSE                TRUE                   TRUE
15      FALSE          TRUE               FALSE                  FALSE
16         NA            NA               FALSE                   TRUE
17      FALSE          TRUE                  NA                     NA
18      FALSE         FALSE               FALSE                   TRUE
19       TRUE         FALSE                TRUE                   TRUE
20      FALSE            NA               FALSE                  FALSE
21         NA            NA               FALSE                     NA
22       TRUE            NA                TRUE                  FALSE
23         NA            NA                TRUE                     NA
24      FALSE            NA                  NA                     NA
jaslinkalra commented 5 years ago

Is this still an issue that I can resolve? Are there more cases like this where the curation logic failed?

philerooski commented 5 years ago

I was working on a script to fix but the logic is incomplete https://github.com/philerooski/neurolincsdreamchallenge/blob/fix-lost-tracking/R/fix_lost_tracking.R

Maybe it's time to make this issue a priority again.

jaslinkalra commented 5 years ago

Okay, I suggest we update the manual errors I provided as csv file first as I found lost tracking cases similar to this issue here stemming from manual errors in reporting correct object labels found. Let me know if I need to update the curation table in synapse.

philerooski commented 5 years ago

Note to self (correct me if I'm wrong): Update the relevant rows in this

https://www.synapse.org/#!Synapse:syn11378063/tables/

with this

https://www.synapse.org/#!Synapse:syn18134075

before fixing Lost_Tracking labels as described above.

philerooski commented 4 years ago

Fixed by #17