RGLab / CytoML

A GatingML Interface for Cross Platform Cytometry Data Sharing
GNU Affero General Public License v3.0
29 stars 14 forks source link

Update logic of search_for_fcs within FlowJo workspace parser #77

Closed jacobpwagner closed 4 years ago

jacobpwagner commented 4 years ago

You can see https://github.com/RGLab/CytoML/issues/76 for further discussion of the motivations for rewriting the logical flow of search_for_fcs. Changes made:

  1. Gather files first by whether base filename within directory matches sample name
  2. If none found, gather files with $FIL keyword matching sample name -- Note, the second commit reduces directory traversals and repeated file access by keeping track of matches during steps 1 and 2 and ensures $TOT matches for any prospective matched file. It also converts the ballooning nested if/else's to switches for (in my opinion) slightly improved readability. If this level of reworking isn't desired, the second commit can just be reverted as the end behavior is the same after the first commit.
  3. Attempt to filter down to single matched file.
    • If already single file, UID is just sample name. NOTE: this is different from prior behavior where $TOT was always automatically appended
    • First filter down to those files that match $TOT, and if that's not enough, try using the sequence of selected keywords + sampleID
    • Only append as much information to the UID as was necessary for the unambiguous match
    • If there are still multiple matched files, throw error with message containing all prospective matched full file paths and prompt user to resolve the ambiguity by moving files. NOTE: I kept this case an error because that was its prior behavior, but in this case should we just report a warning and skip the sample?

Testing: From the main CytoML tests, there are two expected failures:

Here, $TOT is no longer appended to the sample name because it is not necessary to determine the match

test-flowjo2gs.R:107: failure: use additional keywords for guid
sampleNames(gs2[[1]]) not equal to paste(sampleNames(gh), trimws(keyword(gh)[["$TOT"]]), sep = "_").
1/1 mismatches
x[1]: "CytoTrol_CytoTrol_1.fcs"
y[1]: "CytoTrol_CytoTrol_1.fcs_119531"

And here is the case I discussed in https://github.com/RGLab/CytoML/issues/76#issuecomment-578343876, which was only succeeding because of inconsistent logic before. Now it hits the new error message pointing out the ambiguously matched files:

test-GatingSet2flowJo.R:181: error: gatingset_to_flowjo: handle special encoding in keywords 
Multiple FCS files match sample A01 by filename, event count, and keywords.
Candidates are: 
/tmp/RtmpLSu2d1/s5a01.fcs
/tmp/RtmpLSu2d1/file759152cbdb51/s5a01.fcs
Please move incorrect files out of this directory or its subdirectories or this sample will be excluded.

If/when we decide this is good to merge, I can add a commit to update those two tests to reflect the change. I'll also be adding more test cases with problematic datasets that hit each of the subcases.

@mikejiang, I currently cannot run all of the tests in flowjo2gs_internalTestSuite.R because of access to some recently added files. Would you be able to run those tests before/after these commits to see if any new errors arise?

jacobpwagner commented 4 years ago

@mikejiang , after 1422a213ecee71ebdec52267d28ac1525f50508f, the only failing tests are ones that are appropriately failing due to the new behavior/rules. Specifically, this one is failing due to multiple matches as discussed before: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/test-GatingSet2flowJo.R#L181

And these are failing because we are now strictly enforcing a total event count match between the workspace and the FCS file:

https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/flowjo2gs_internalTestSuite.R#L275-L282

^ For that one, I'm not sure why it's not getting the appropriate total counts from the workspace, but it's v10.0.6

> fj_ws_get_samples(ws)
    sampleID                 name count pop.counts
1        299            39191.fcs    -1          5
2        230 CFSP7007V1_TCELL.fcs    -1         16
3        228    CFSP7007V1_DC.fcs    -1         11
4        227 CFSP7007V1_BCELL.fcs    -1         18
5        229    CFSP7007V1_NK.fcs    -1         14
6         33            35122.fcs    -1         15
7         32            35121.fcs    -1         11
8         35            35120.fcs    -1         18
...

https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/flowjo2gs_internalTestSuite.R#L397-L410

^ The last case which relaxes the rules for matching $TOT is exactly what we are now blocking, so that last test should probably just be removed