RGLab / CytoML

A GatingML Interface for Cross Platform Cytometry Data Sharing
GNU Affero General Public License v3.0
29 stars 14 forks source link

Improve how flowjo_to_gatingset handles ambiguous sample/file mapping #76

Closed jacobpwagner closed 4 years ago

jacobpwagner commented 4 years ago

When determining the FCS file corresponding to a sample specified in a FlowJo workspace, parse_sample calls search_for_fcs: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L406-L463

which attempts to match the sample to a single FCS file by considering

Prior experience suggests that it is a good idea to check that the total number of events matches, regardless of what other keywords may also be checked. This is a good check that the particular file versions being read are actually corresponding to those analyzed in FlowJo to produce the workspace and serves to prevent silent but pernicious analysis errors.

So, the proposed new logic is instead:

Additionally, there is the question of when additional keyword information should be appended to the UIDs and/or sample names in the resulting GatingSet. Generally it seems users do not expect to have the total number of events appended to the sample names (like some_cells.fcs_303254), unless it is necessary to resolve ambiguity or explicitly requested. We should think about what default and optional behavior should be in that regard.

mikejiang commented 4 years ago

Alternatively, we could modify the logic as:

This may potentially simplify the code, but you can decide implementation details

jacobpwagner commented 4 years ago

I like that flow, and an unambiguous filename+$TOT match should (ideally) be the most common case.

jacobpwagner commented 4 years ago

I think I've worked out almost all of the wrinkles, except for one last design question. Currently, the logic of search_for_fcs will gather all files from the entire tree at and below the working directory (or directory specified by the path arg) with base filename matching the sample name: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L413-L419 which means that if there is a matching file in a subdirectory, it will later cause flowjo_to_gatingset to error out here: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L444-L456

For example, this tree structure will cause the "Duplicated FCS" error if you attempt to parse fj_ws.xml starting from that top directory:

.
├── file1.fcs
├── fj_ws.xml
└── sub_dir
    └── file1.fcs

This is actually circumvented only in the case where the filename doesn't match, but $FIL does, because of the break statement here: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L421-L440 That is actually all that keeps this test passing, as by this point the directory is cluttered with another subdirectory with s5a01.fcs (but the sample name in manual.xml is A01 so it hits the break): https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/test-GatingSet2flowJo.R#L181

Getting to the point, do we want to actually make the logic give priority to the top directory so that even if there is a file found in a subdirectory, the top-level file will be used and prevent the error? @mikejiang ? @gfinak ?

jacobpwagner commented 4 years ago

Another idea from offline discussion with Mike: We could add a "greedy_match" flag that would dictate that the behavior should be to just take the first good match (sample name, $TOT, and any additional keywords) to avoid unnecessary directory traversal and prioritize completion. That's pretty easy to add by just wrapping break statements in greedy_match checks.