JinmiaoChenLab / cytofkit

cytofkit: an integrated flow/mass cytometry data analysis pipeline
http://jinmiaochenlab.github.io/cytofkit/
57 stars 25 forks source link

Possible bug when using Phenograph with merge method = FIXED and number of events lower than fixedNum #12

Closed lconde-ucl closed 6 years ago

lconde-ucl commented 6 years ago

Hi,

I have been using cytofkit for a while and I have noticed that when I try to analyze data with merge method = FIXED and one or more of the FCS files has a lower number of events than the specified fixedNum (i.e., sampling with replacemenet), phenograph gets stuck for a long time in the "Finding nearest neighbors..." step until it crashes.

For example, a run on a single FCS file with 5004 events, merge = FIXED and FixedNum = 10000 ends like this:

[...]
> cluster_res <- lapply(clusterMethods, cytof_cluster, 
                   ydata = allDimReducedList[[dimReductionMethod]], 
                   xdata = exprs_data)
 Running PhenoGraph...Run Rphenograph starts:
 -Input data of 10000 rows and 17 columns
 -k is set to 30
 Finding nearest neighbors...

Whereas if I run the same dataset with FixNum = 5000, it runs no problem:

[...]
 > cluster_res <- lapply(clusterMethods, cytof_cluster, 
                           ydata = allDimReducedList[[dimReductionMethod]], 
                           xdata = exprs_data)
  Running PhenoGraph...Run Rphenograph starts:
  -Input data of 5000 rows and 17 columns
  -k is set to 30
  Finding nearest neighbors...DONE ~ 1.957 s
  Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 1.328 s
  Build undirected graph from the weighted links...DONE ~ 0.687 s
  Run louvain clustering on the graph ...DONE ~ 0.603 s
Run Rphenograph DONE, took a total of 4.575s.
  Return a community class
  -Modularity value: 0.7892829 
  -Number of clusters: 13 DONE!
[...]

Looks like the merge with replacement is done correctly ("Input data of 10000|5000 rows...'), but for some reason Phenograph does not like it.

If I use other merge methods (ceil, all), it runs fine, it only crashes when merge = fixed and the number of events is lower than FixNum.

Also, I saw that there was an unrelated bug in version 1.8.3 (issue #11), so just in case this bug was also specific for version v1.8.3, I tried with versions 1.6.5 and 1.9.4 and in both versions I find the same issue, i.e. phenograph crashing when merging with replacement.

Happy to send example files if you want to try to replicate the problem.

Thanks in advance for your time and for developing such a great package, Best regards, Lucia

MattMyint commented 6 years ago

Hi Lucia,

Thanks for identifying this! It seems to be an issue with the "Finding neighbours stage", where the function nn2 from the package RANN is used.

The treetype argument here is "bd", which according to this issue page, gives the error you reported when there are duplicates in the data (such as those created when sampling with replacement).

I'll look into whether its feasible to use "kd" rather than "bd", or add an internal test for duplicates to decide which is best.

SamGG commented 6 years ago

If kd fails, adding some noise may do the job: I think rnorm(..., sd = 1e-7) should avoid data duplication.

MattMyint commented 6 years ago

I realised there may be downstream problems if doing sampling with replacement.

I've come up with two possible solutions if cytofkit detects: number of events in any FCS file < fixedNum, and mergeMethod = "fixed"

Solution 1: Force mergeMethod = "ceil"

Solution 2: Force fixedNum to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each, fixedNum will be set to 4000.

Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.

However, if you feel that solution 1 would be more appropriate for analysis, do let me know!

lconde-ucl commented 6 years ago

Sounds good to me. Thanks so much for looking into this! Lucia

traumenCapote commented 5 years ago

Solution 2: Force fixedNum to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each, fixedNum will be set to 4000.

Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.

So then is there no way to sample with replacement as the documentation suggests? I'd like to set the number to the maximum, not the minimum, number of events across fcs files and upsample.

SamGG commented 5 years ago

Hi, How many cells per sample have you got? The sample function could do with replacement, but t is not implemented in cytofkit. One reason is that tSNE dislikes identical cells and there is usually a warning about that unless the corresponding flag has been set to FALSE. So I think there is something bad about identical cells in tSNE and there may be something wrong with the overall cytofkit pipeline as well. Be also warned that computation will be longer as well. If you really want to try it it's not a big deal to replace the code of fixedNum to sample with replacement. Best.