RGLab / CytoML

A GatingML Interface for Cross Platform Cytometry Data Sharing
GNU Affero General Public License v3.0
30 stars 14 forks source link

Diva Gates are not imported / transformed correctly. #64

Closed JimboMahoney closed 5 years ago

JimboMahoney commented 5 years ago

Hi there!

Firstly, MANY thanks for this code! I've used it successfully to compare manual gates with automated clustering.

However, I've found a couple of issues, one of which I can work around and one I cannot.

Here is my code:

ws <- open_diva_xml(file) diva_get_sample_groups(ws) gs <- diva_to_gatingset(ws, name = 1, subset = 1, worksheet="global")

There is also this, but I'm not entirely sure it's necessary:

gh <- gs[[1]]

Issue 1:

The third line "breaks" if the directory containing the XML / FCS files contains subdirectories with files of the same name. For some reason, it also looks in the subdirectories, finds files with the same name and then reports that it cannot open the specified FCS file.

At first I was stumped, but after removing the subdirectories, it opens the file OK.

Issue 2:

I'm not sure if this is an issue with Diva (I'm using 8.0.2) or with CytoML. I'll try and explain.

I have an experiment with the original FCS files captured in Diva. This opens fine, as well as all gates and cell populations.

However, I get very strange results if I do the following:

1) Export a subset of that experiment (e.g. after doing basic gating, such as lymphocytes and singlets) to an FCS file (so that I have only the cells I'm interested in). 2) Open that FCS file as a new experiment in Diva. 3) Set my gates. Looks good in Diva, with all stats being as I expect. 4) Export the experiment. 5) Import using CytoML.

The gates look good using this code:

plot(gs)

and also this:

gs_get_pop_paths(gh,path="auto")

However, the stats are only correct for the first two gates:

gh_pop_compare_stats(gh)

Any further gates are unpopulated, except for one that only contains 4 events (it should contain 598).

In addition, the plotting of the graphs doesn't work:

plotGate(gh)

It gives the following error:

"Error in valid.viewport(x, y, width, height, just, gp, clip, xscale, yscale, : invalid 'yscale' in viewport"

And a bunch of the following warnings:

"3: In min(x) : no non-missing arguments to min; returning Inf 4: In max(x) : no non-missing arguments to max; returning -Inf"

Any ideas on what's going on would be appreciated! Let me know if you want / need the Diva experiments (they're only small).

Thanks!

jacobpwagner commented 5 years ago

Hi @JimboMahoney , I'll address the issues in order:

1) Yeah, I understand it's a little frustrating. The choice to use the recursive=TRUE option in list.files within diva_to_gatingset is intentional as it makes it easier for users to have the xml in a top directory and have the FCS files in a nested data directory, without having to explicitly point it there with the path arg. If that's a real problem, we could maybe allow the user to pass in a value to recursive that would be honored by list.files so recursive=FALSE would only look in the top level. Either way, it's dangerous to silently guess how users would want the issue of duplicate filenames resolved, so it's safer to just error and prompt them resolve it.

2) If you could attach the Diva experiments and FCS files or a location where I could access each, that would help. Until I have a chance to look at it, a few quick things to try/examine:

First, are the gate definitions for the new gates getting read in correctly? Is this successful (and new_gate looks as expected?:

new_gate <- gs_pop_get_gate(gs, "<new node name>")` 

Second, what is the result after recompute(gs)?

jacobpwagner commented 5 years ago

Also, I hope you don't mind if I remove the octothorps from your post to remove the (I assume) unintentional issue cross-linking.

JimboMahoney commented 5 years ago

Hi Jacob,

Hah, I didn't even know they were sometimes called octothorps! Cool!

Re: 1 - That's cool, but maybe the error message could be more helpful? My error "cannot find file" (or similar) made me think it wasn't actually finding the file, rather than actually finding multiple copies. I also didn't realise it was recursive.

Re: 2 - I'll give that a go and get back to you.

Thanks!

JimboMahoney commented 5 years ago

Experiment.zip Experiment_151035.zip

Hi Jacob,

Thanks for your help. Attached are the two Diva experiments I'm working with.

Luckily they are both really small.

"Experiment" works fine and was the "original" captured on an Aria.

"Experiment 151035" was created by importing a single FCS file from the original "Experiment". That FCS file was created by exporting a subset of cells from the original, using a gate. (I don't know if that is what makes the difference, but it feels like it might?). UPDATE - Actually, that's not it. I opened another experiment acquired on a Canto. The same issue is present there, even if I export all the data. The gates that existed when the experiment was first acquired are populated, but not any gates that I subsequently add in further analysis. The new gates show, but contain no cells.

It looks as I expect in Diva, with all the gates / statistics etc. However, once imported in R, only the first two gates give the correct data ("root" and "T Cells"). The next gates down ("CD8" and "CD4") are weird - one shows nothing and the other has 4 events. They should have 527 / 1224 respectively.

In both cases, I'm using the "Mixed Cells" FCS file and a global worksheet.

gs_pop_get_gate worked fine and tells me the X/Y parameters and the size/shape of the gate. recomputejust returns "done!" and doesn't seem to make any difference.

jacobpwagner commented 5 years ago

Thanks @JimboMahoney, I'm looking in to it now.

jacobpwagner commented 5 years ago

Hi @JimboMahoney . So, I think I found the proximate cause, but I'm still working on the root cause. One way or another, the gate bounds are not getting transformed correctly. This is the CD4 gate that is being parsed in

> g
Rectangular gate 'CD4' with dimensions:
  APC-H7-A: (0.0198189485818148,1.535968542099)
  BB515-A: (2.0697820186615,3.47838377952576)

which should overlay the CD4+ population here:

bad_gate

But as you can see by the bounds, and the next zoomed out plot, the y dimension of the gate is off so it's landing well above the population.

bad_gate_zoom_out I have some ideas about what is causing this, but it would help me if you could paste in the output of sessionInfo() so I can see the versions of all relevant packages.

JimboMahoney commented 5 years ago

Here's that particular plot & gate from Diva:

image

According to Diva, it contains 1224 cells.

Here's my sessionInfo():

R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages: [1] tcltk stats graphics grDevices utils datasets methods base

other attached packages: [1] tidyr_1.0.0 flowCore_1.51.7 BiocManager_1.30.4 openCyto_1.23.5 flowWorkspace_3.33.9 [6] devtools_2.2.1 usethis_1.5.1 CytoML_1.11.9 svDialogs_1.0.0

loaded via a namespace (and not attached): [1] ncdfFlow_2.31.3 matrixStats_0.55.0 fs_1.3.1 RColorBrewer_1.1-2 rprojroot_1.3-2 Rgraphviz_2.29.0
[7] backports_1.1.4 tools_3.6.1 R6_2.4.0 KernSmooth_2.23-15 lazyeval_0.2.2 BiocGenerics_0.31.6 [13] colorspace_1.4-1 withr_2.1.2 tidyselect_0.2.5 gridExtra_2.3 prettyunits_1.0.2 mnormt_1.5-5
[19] processx_3.4.1 compiler_3.6.1 graph_1.63.0 cli_1.1.0 Biobase_2.45.1 flowClust_3.23.0
[25] ggcyto_1.13.2 desc_1.2.0 flowStats_3.43.8 svGUI_1.0.0 scales_1.0.0 DEoptimR_1.0-8
[31] hexbin_1.27.3 mvtnorm_1.0-11 robustbase_0.93-5 callr_3.3.2 RBGL_1.61.0 stringr_1.4.0
[37] digest_0.6.21 R.utils_2.9.0 rrcov_1.4-7 base64enc_0.1-3 pkgconfig_2.0.3 sessioninfo_1.1.1
[43] rlang_0.4.0 rstudioapi_0.10 jsonlite_1.6 mclust_5.4.5 gtools_3.8.1 dplyr_0.8.3
[49] R.oo_1.22.0 magrittr_1.5 Matrix_1.2-17 Rcpp_1.0.2 munsell_0.5.0 lifecycle_0.1.0
[55] R.methodsS3_1.7.1 stringi_1.4.3 yaml_2.2.0 MASS_7.3-51.4 zlibbioc_1.31.0 pkgbuild_1.0.5
[61] plyr_1.8.4 grid_3.6.1 parallel_3.6.1 crayon_1.3.4 lattice_0.20-38 splines_3.6.1
[67] zeallot_0.1.0 ps_1.3.0 pillar_1.4.2 fda_2.4.8 corpcor_1.6.9 stats4_3.6.1
[73] pkgload_1.0.2 XML_3.98-1.20 glue_1.3.1 latticeExtra_0.6-28 remotes_2.1.0 data.table_1.12.2
[79] RcppParallel_4.4.4 vctrs_0.2.0 testthat_2.2.1 gtable_0.3.0 purrr_0.3.2 clue_0.3-57
[85] assertthat_0.2.1 ks_1.11.5 ggplot2_3.2.1 IDPmisc_1.1.19 pcaPP_1.9-73 tibble_2.1.3
[91] memoise_1.1.0 flowViz_1.49.2 ellipse_0.4.1 cluster_2.1.0 ellipsis_0.3.0

Apologies if that doesn't paste well - I'm fairly new to both GitHub and R.

PS - two additional comments / questions:

1) When I load CytoML, I get this warning (not sure if this is relevant):

Loading required package: CytoML Registered S3 method overwritten by 'R.oo': method from
throw.default R.methodsS3

2) If some packages are screwing things up, how come the gate import works perfectly for the "original" dataset (even when I add more gates to it), but not this file?

Thanks for your help!

jacobpwagner commented 5 years ago

Alright. It looks like the transformations for each of the channels for that specimen differ between Cell_Sort.xml and Experiment_151035.xml. Using BB414-A as an example:

Cell_Sort.xml

<tube name="Mixed Cells">
...
<instrument_settings name="Cytometer Settings" template="false">
...
                    <parameter name="BB515-A" type="30">
                        <is_log>true</is_log>
                        <is_quantitated>false</is_quantitated>
                        <min>0.0</min>
                        <max>5.4185380935668945</max>
                        <raw_index>8</raw_index>
                        <linear_index>12</linear_index>
                        <log_index>20</log_index>
                        <final_index>20</final_index>
                        <fl>BB515</fl>
                        <voltage>365</voltage>
                        <target>3</target>
                        <brightness>1.0</brightness>
                        <threshold>5000</threshold>
                        <trigger>false</trigger>
                        <labels_only>false</labels_only>
                        <biexp_scale>0</biexp_scale>
                        <comp_biexp_scale>0</comp_biexp_scale>
                        <manual_biexp_scale>0</manual_biexp_scale>
                        <can_be_compensated>true</can_be_compensated>
                        <compensation>
                            <compensation_coefficient>0.0</compensation_coefficient>
                            <compensation_coefficient>1.00003068</compensation_coefficient>
                            <compensation_coefficient>0.0</compensation_coefficient>
                            <compensation_coefficient>-0.00166979</compensation_coefficient>
                        </compensation>
                    </parameter>

Experiment_151035.xml

<tube name="Mixed Cells">
...
<instrument_settings name="Cytometer Settings" template="false">
...
                    <parameter name="BB515-A" type="30">
                        <is_log>true</is_log>
                        <is_quantitated>false</is_quantitated>
                        <min>0.0</min>
                        <max>5.4185380935668945</max>
                        <raw_index>8</raw_index>
                        <linear_index>12</linear_index>
                        <log_index>20</log_index>
                        <final_index>20</final_index>
                        <fl>BB515</fl>
                        <voltage>365</voltage>
                        <target>3</target>
                        <brightness>1.0</brightness>
                        <threshold>5000</threshold>
                        <trigger>false</trigger>
                        <labels_only>false</labels_only>
                        <biexp_scale>43</biexp_scale>
                        <comp_biexp_scale>44</comp_biexp_scale>
                        <manual_biexp_scale>0</manual_biexp_scale>
                        <can_be_compensated>true</can_be_compensated>
                        <compensation>
                            <compensation_coefficient>0.0</compensation_coefficient>
                            <compensation_coefficient>1.0000306827316177</compensation_coefficient>
                            <compensation_coefficient>0.0</compensation_coefficient>
                            <compensation_coefficient>-0.0016697859380151786</compensation_coefficient>
                        </compensation>
                    </parameter>

I'm still looking in to it to see if we should be changing any part of the logic of diva_to_gatingset to account for this case or if this is an issue on the export side, but making the transformations on that specimen in Experiment_151035.xml agree with Cell_Sort.xml seems to fix the problem. This is the same archive with the patched xml (Experiment_151035_fixed.xml) included: Experiment_151035_patched.zip

The stats after diva_to_gatingset using Experiment_151035_fixed.xml (minor differences but within reasonable tolerance of the numbers you quoted):

                       pop count
1:                    root  2723
2:                /T Cells  2043
3:            /T Cells/CD4  1218
4:  /T Cells/CD4/Naive CD4   589
5: /T Cells/CD4/Memory CD4   587
6:            /T Cells/CD8   534
7: /T Cells/CD8/Memory CD8   123
8:  /T Cells/CD8/Naive CD8   395

Plots of the gates. plot_zoom_png plot_zoom_png

Of course, the scales are still off even if they are now in agreement, so I'll keep investigating and update this based on what I find out.

JimboMahoney commented 5 years ago

Thanks for looking into this Jacob.

Almost feels like a Diva bug... It's certainly strange that both files are biexponetially displayed in Diva, but for some reason it's outputting different flags for each.

jacobpwagner commented 5 years ago

@JimboMahoney , see/pull 4308271. It looks like it may have been an issue with how the transformations were handled and may be related to https://github.com/RGLab/flowWorkspace/commit/d9a23c36c808358316a57a7f5b72878f07d1feb1 and https://github.com/RGLab/flowWorkspace/commit/63c9eb2168d8119f4f23d6e29bb1ddf6dc679e3b. But anyway, after that simple commit, both of your original experiments parse successfully and the stats seem to even more closely mirror the Diva stats.

                       pop count
1:                    root  2723
2:                /T Cells  2043
3:            /T Cells/CD4  1224
4:  /T Cells/CD4/Naive CD4   595
5: /T Cells/CD4/Memory CD4   587
6:            /T Cells/CD8   527
7: /T Cells/CD8/Memory CD8   116
8:  /T Cells/CD8/Naive CD8   395
jacobpwagner commented 5 years ago

Actually, I am going to reopen this while I work on a more robust solution.

jacobpwagner commented 5 years ago

Alright @JimboMahoney , I ran down the root of the issue and implemented a better solution in 310e95ed2ac8538a27e5b61870d7bacb8a05d558. It took some adjustments to flowWorkspace as well, so make sure to pull those changes and rebuild flowWorkspace before pulling the changes to CytoML and rebuilding it.

jacobpwagner commented 5 years ago

And after RGLab/ggcyto@127877024a2b66be8db3c22f297eb081383dc7ba, the scales of the axes should be good as well, so the plots will properly reflect the scales: plot_zoom_png

JimboMahoney commented 5 years ago

Thanks for all your help on this - as per the other thread, the gates now work as expected!

Actually, the gates are still wrong.

Trying a more complex experiment with more gates and the stats are really different between Diva and CytoML.

Whilst it's not bad in this original dataset, it's very bad for more complex gates.

I've no idea if it's possible to get an exact correlation between Diva gates and CytoML, but I suggest keeping this issue open?

I've added more gates to this small dataset if you'd like me to upload the Diva experiment and the stats shown in Diva to compare?

jacobpwagner commented 5 years ago

@JimboMahoney , if you could upload the new experiment, that would help a lot. I'll try to get this fully figured out. It's a bit of a reverse-engineering game.

JimboMahoney commented 5 years ago

CytoML Gate Test.zip

Hi Jacob,

OK, I've uploaded a single FCS file experiment from the more complex gating strategy, which shows the much larger discrepancies (e.g. the rMono population).

e.g.

openCyto.freq xml.freq openCyto.count xml.count node 1: 1.000000000 0 109925 0 root 2: 0.603702524 0 66362 0 All Cells 3: 0.695367831 0 46146 0 Singlets 4: 0.593615915 0 27393 0 CD45+ 5: 0.105793451 0 2898 0 Ly6G+ 6: 0.413733609 0 1199 0 Neutrophils 7: 0.854889935 0 23418 0 Ly6G- 8: 0.416517209 0 9754 0 CD11cCD11b+ 9: 0.189665778 0 1850 0 DC Populations 10: 0.137071970 0 1337 0 iMono 11: 0.003485749 0 34 0 rMono 12: 0.456742677 0 10696 0 CD11c-CD11b- 13: 0.495044877 0 5295 0 T Cells 14: 0.446896036 0 4780 0 B Cells 15: 0.100136647 0 2345 0 CD11c+CD11b- 16: 0.973987207 0 2284 0 Alveolar Macrophages

Diva shows:

Population #Events
All Events 109925
All Cells 66441
Singlets 46681
CD45+ 29426
Ly6G+ 2212
Neutrophils 1863
Ly6G- 27126
CD11cCD11b+ 9954
DC Populations 1686
iMono 1344
rMono 3703
CD11c-CD11b- 12146
T Cells 5992
B Cells 5858
CD11c+CD11b- 4460
Alveolar Macrophages 4332

Perhaps it's also interesting to see what happens if I disable compensation of this file in Diva (it becomes somewhat more similar to CytoML's):

Population #Events
All Events 109925
All Cells 66441
Singlets 46681
CD45+ 29545
Ly6G+ 3138
Neutrophils 1382
Ly6G- 25134
CD11cCD11b+ 9833
DC Populations 1892
iMono 1411
rMono 38
CD11c-CD11b- 10848
T Cells 5351
B Cells 4903
CD11c+CD11b- 3832
Alveolar Macrophages 3765
jacobpwagner commented 5 years ago

Thanks @JimboMahoney , I'm looking in to it.

jacobpwagner commented 5 years ago

As you highlighted, this is most likely due to the compensation matrix from the XML for that sample not being applied correctly. Working on a fix.

jacobpwagner commented 5 years ago

Alright @JimboMahoney , it should be good after 901a222c98588c2e5e592780dafff894657f3947. Try pulling that and making sure it works. This is what I'm getting for stats:

                                                                  pop  count
 1:                                                              root 109925
 2:                                                        /All Cells  66362
 3:                                               /All Cells/Singlets  46146
 4:                                         /All Cells/Singlets/CD45+  29091
 5:                                   /All Cells/Singlets/CD45+/Ly6G+   2183
 6:                       /All Cells/Singlets/CD45+/Ly6G+/Neutrophils   1850
 7:                                   /All Cells/Singlets/CD45+/Ly6G-  26801
 8:                       /All Cells/Singlets/CD45+/Ly6G-/CD11cCD11b+   9906
 9:        /All Cells/Singlets/CD45+/Ly6G-/CD11cCD11b+/DC Populations   1667
10:                 /All Cells/Singlets/CD45+/Ly6G-/CD11cCD11b+/iMono   1331
11:                 /All Cells/Singlets/CD45+/Ly6G-/CD11cCD11b+/rMono   3661
12:                      /All Cells/Singlets/CD45+/Ly6G-/CD11c-CD11b-  12110
13:              /All Cells/Singlets/CD45+/Ly6G-/CD11c-CD11b-/T Cells   5959
14:              /All Cells/Singlets/CD45+/Ly6G-/CD11c-CD11b-/B Cells   5849
15:                      /All Cells/Singlets/CD45+/Ly6G-/CD11c+CD11b-   4198
16: /All Cells/Singlets/CD45+/Ly6G-/CD11c+CD11b-/Alveolar Macrophages   4069

which is much closer to the Diva values. I'm going to close this, but feel free to comment/open it/start another issue if you come across anything else.

JimboMahoney commented 5 years ago

Thanks Jacob - but I'm afraid I don't know how to "pull" that!

I'm a novice Github and R user and all I know is that the following works on my system:

BiocManager::install("CytoML", version = "devel")

But the following doesn't:

install_github("RGLab/CytoML", ref = "trunk")

From what I can tell, getting the second one to work on a Windows system is non-trivial (I had hoped that installing the xml2 library would work, but it doesn't). i.e. see #65

I guess I'll have to wait until this gets pushed over to bioconductor.

jacobpwagner commented 5 years ago

Sorry @JimboMahoney , my fault. It's a bad habit from talking to people who are installing from the github repo. I moved it over to Bioconductor with a version bump, but it will take a day or so for that to be available via BiocManager::install("CytoML", version = "devel"). Additionally, the Bioconductor 3.10 release is in two days, so this will be in 3.10 as well.

JimboMahoney commented 5 years ago

Awesome - just updated from bioconductor and it's much better now, thanks!