Closed lconde-ucl closed 6 years ago
Hi Lucia,
Thanks for identifying this! It seems to be an issue with the "Finding neighbours stage", where the function nn2
from the package RANN is used.
The treetype argument here is "bd", which according to this issue page, gives the error you reported when there are duplicates in the data (such as those created when sampling with replacement).
I'll look into whether its feasible to use "kd" rather than "bd", or add an internal test for duplicates to decide which is best.
If kd fails, adding some noise may do the job: I think rnorm(..., sd = 1e-7) should avoid data duplication.
I realised there may be downstream problems if doing sampling with replacement.
I've come up with two possible solutions if cytofkit detects: number of events in any FCS file < fixedNum
, and mergeMethod = "fixed"
Solution 1: Force mergeMethod = "ceil"
Solution 2: Force fixedNum
to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each, fixedNum
will be set to 4000.
Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.
However, if you feel that solution 1 would be more appropriate for analysis, do let me know!
Sounds good to me. Thanks so much for looking into this! Lucia
Solution 2: Force
fixedNum
to the lowest number of events across the FCS files (e.g. if I have 5 FCS files with 4000, 5000, 5500, 6000 and 4500 events each,fixedNum
will be set to 4000.Currently, I've decided to go with solution 2, as I'm of the opinion that it would provide more balanced analysis results if the same amount of data was extracted from each FCS.
So then is there no way to sample with replacement as the documentation suggests? I'd like to set the number to the maximum, not the minimum, number of events across fcs files and upsample.
Hi, How many cells per sample have you got? The sample function could do with replacement, but t is not implemented in cytofkit. One reason is that tSNE dislikes identical cells and there is usually a warning about that unless the corresponding flag has been set to FALSE. So I think there is something bad about identical cells in tSNE and there may be something wrong with the overall cytofkit pipeline as well. Be also warned that computation will be longer as well. If you really want to try it it's not a big deal to replace the code of fixedNum to sample with replacement. Best.
Hi,
I have been using cytofkit for a while and I have noticed that when I try to analyze data with merge method = FIXED and one or more of the FCS files has a lower number of events than the specified fixedNum (i.e., sampling with replacemenet), phenograph gets stuck for a long time in the "Finding nearest neighbors..." step until it crashes.
For example, a run on a single FCS file with 5004 events, merge = FIXED and FixedNum = 10000 ends like this:
Whereas if I run the same dataset with FixNum = 5000, it runs no problem:
Looks like the merge with replacement is done correctly ("Input data of 10000|5000 rows...'), but for some reason Phenograph does not like it.
If I use other merge methods (ceil, all), it runs fine, it only crashes when merge = fixed and the number of events is lower than FixNum.
Also, I saw that there was an unrelated bug in version 1.8.3 (issue #11), so just in case this bug was also specific for version v1.8.3, I tried with versions 1.6.5 and 1.9.4 and in both versions I find the same issue, i.e. phenograph crashing when merging with replacement.
Happy to send example files if you want to try to replicate the problem.
Thanks in advance for your time and for developing such a great package, Best regards, Lucia