Improve how flowjo_to_gatingset handles ambiguous sample/file mapping

jacobpwagner commented 4 years ago

When determining the FCS file corresponding to a sample specified in a FlowJo workspace, parse_sample calls search_for_fcs: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L406-L463

which attempts to match the sample to a single FCS file by considering

filename matching the sample name
additional keywords specified by additional.keys arg passed down from the R level. The default is to check for the matching total number of events ($TOT keyword)

Prior experience suggests that it is a good idea to check that the total number of events matches, regardless of what other keywords may also be checked. This is a good check that the particular file versions being read are actually corresponding to those analyzed in FlowJo to produce the workspace and serves to prevent silent but pernicious analysis errors.

So, the proposed new logic is instead:

check for a filename match
If there's ambiguity, use $TOT and (optionally) additional.keys to resolve it
Either way, when there is a prospective unambiguous match, make sure $TOT from the FCS matches the number of events from the workspace

Additionally, there is the question of when additional keyword information should be appended to the UIDs and/or sample names in the resulting GatingSet. Generally it seems users do not expect to have the total number of events appended to the sample names (like some_cells.fcs_303254), unless it is necessary to resolve ambiguity or explicitly requested. We should think about what default and optional behavior should be in that regard.

mikejiang commented 4 years ago

Alternatively, we could modify the logic as:

check for a filename+$TOT match
If there's ambiguity, use additional.keys to resolve it (and append them to sample names)
If no ambiguity, try to drop $TOT to see if filename alone is sufficient

This may potentially simplify the code, but you can decide implementation details

jacobpwagner commented 4 years ago

I like that flow, and an unambiguous filename+$TOT match should (ideally) be the most common case.

jacobpwagner commented 4 years ago

I think I've worked out almost all of the wrinkles, except for one last design question. Currently, the logic of search_for_fcs will gather all files from the entire tree at and below the working directory (or directory specified by the path arg) with base filename matching the sample name: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L413-L419 which means that if there is a matching file in a subdirectory, it will later cause flowjo_to_gatingset to error out here: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L444-L456

For example, this tree structure will cause the "Duplicated FCS" error if you attempt to parse fj_ws.xml starting from that top directory:

.
├── file1.fcs
├── fj_ws.xml
└── sub_dir
    └── file1.fcs

This is actually circumvented only in the case where the filename doesn't match, but $FIL does, because of the break statement here: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L421-L440 That is actually all that keeps this test passing, as by this point the directory is cluttered with another subdirectory with s5a01.fcs (but the sample name in manual.xml is A01 so it hits the break): https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/test-GatingSet2flowJo.R#L181

Getting to the point, do we want to actually make the logic give priority to the top directory so that even if there is a file found in a subdirectory, the top-level file will be used and prevent the error? @mikejiang ? @gfinak ?

jacobpwagner commented 4 years ago

Another idea from offline discussion with Mike: We could add a "greedy_match" flag that would dictate that the behavior should be to just take the first good match (sample name, $TOT, and any additional keywords) to avoid unnecessary directory traversal and prioritize completion. That's pretty easy to add by just wrapping break statements in greedy_match checks.

RGLab / CytoML

Improve how flowjo_to_gatingset handles ambiguous sample/file mapping #76