Closed jacobpwagner closed 4 years ago
Alternatively, we could modify the logic as:
This may potentially simplify the code, but you can decide implementation details
I like that flow, and an unambiguous filename+$TOT
match should (ideally) be the most common case.
I think I've worked out almost all of the wrinkles, except for one last design question. Currently, the logic of search_for_fcs
will gather all files from the entire tree at and below the working directory (or directory specified by the path
arg) with base filename matching the sample name:
https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L413-L419
which means that if there is a matching file in a subdirectory, it will later cause flowjo_to_gatingset
to error out here:
https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L444-L456
For example, this tree structure will cause the "Duplicated FCS" error if you attempt to parse fj_ws.xml
starting from that top directory:
.
├── file1.fcs
├── fj_ws.xml
└── sub_dir
└── file1.fcs
This is actually circumvented only in the case where the filename doesn't match, but $FIL
does, because of the break
statement here:
https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L421-L440
That is actually all that keeps this test passing, as by this point the directory is cluttered with another subdirectory with s5a01.fcs
(but the sample name in manual.xml
is A01
so it hits the break
):
https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/tests/testthat/test-GatingSet2flowJo.R#L181
Getting to the point, do we want to actually make the logic give priority to the top directory so that even if there is a file found in a subdirectory, the top-level file will be used and prevent the error? @mikejiang ? @gfinak ?
Another idea from offline discussion with Mike: We could add a "greedy_match" flag that would dictate that the behavior should be to just take the first good match (sample name, $TOT
, and any additional keywords) to avoid unnecessary directory traversal and prioritize completion. That's pretty easy to add by just wrapping break
statements in greedy_match
checks.
When determining the FCS file corresponding to a sample specified in a FlowJo workspace,
parse_sample
callssearch_for_fcs
: https://github.com/RGLab/CytoML/blob/b8e14a2a85ad2c6ee03977da482a329631d5ba74/inst/include/CytoML/flowJoWorkspace.hpp#L406-L463which attempts to match the sample to a single FCS file by considering
additional.keys
arg passed down from the R level. The default is to check for the matching total number of events ($TOT
keyword)Prior experience suggests that it is a good idea to check that the total number of events matches, regardless of what other keywords may also be checked. This is a good check that the particular file versions being read are actually corresponding to those analyzed in FlowJo to produce the workspace and serves to prevent silent but pernicious analysis errors.
So, the proposed new logic is instead:
$TOT
and (optionally)additional.keys
to resolve it$TOT
from the FCS matches the number of events from the workspaceAdditionally, there is the question of when additional keyword information should be appended to the UIDs and/or sample names in the resulting
GatingSet
. Generally it seems users do not expect to have the total number of events appended to the sample names (likesome_cells.fcs_303254
), unless it is necessary to resolve ambiguity or explicitly requested. We should think about what default and optional behavior should be in that regard.