RGLab / FAUST

Full annotation using shape-constrained trees
GNU General Public License v3.0
26 stars 6 forks source link

Can lower bound in FAUST step five be negtive? #15

Open zonglunli7515 opened 4 years ago

zonglunli7515 commented 4 years ago

Hi,

In step five of the work example, you set the lower bound to 0. Now I want zeros to be considered so I changed 0 to -0.01. However, it turned out that the code chunk in step seven keeps running without any warning/error message. The chunk has been running for more than 5 hours. Previously, with the lower boundary equal 0, the chunk was done within 10 minutes. So is it because there is too much computation involved as 0s are being considered, or simply because the algorithm doesn't accept negative values?

P.S. I noted a tiny contradiction in step five. At the very beginning, you said

"For example, we would expect any cell populations with a median fluorescence intensity (MFI) below 0 in a channel to be annotated as “Low” for that channel."

However, later you stated

"Expression values in a channel less than or equal to the value in the “Low” row are treated as low, by default, and not actively considered when FAUST processes the data."

So 0s are actually considered as active or not in your algorithm?

Thanks in advance for your help.

Allen

gfinak commented 4 years ago

Hi Negative values are allowed and values are considered if they are admitted by the channel bounds. The increased computation time is caused by the large number of zeros and the fact that this introduces many ties in the data set. @evangreene can provide more insight to this. What is your goal with including the zeros? Knowing this might help us find a solution.

Greg Finak

On Tue, Mar 17, 2020, 00:03 zl7515 notifications@github.com wrote:

Hi,

In step five of the work example, you set the lower bound to 0. Now I want zeros to be considered so I changed 0 to -0.01. However, it turned out that the code chunk in step seven keeps running without any warning/error message. The chunk has been running for more than 5 hours. Previously, with the lower boundary equal 0, the chunk was done within 10 minutes. So is it because there is too much computation involved as 0s are being considered, or simply because the algorithm doesn't accept negative values?

P.S. I noted a tiny contradiction in step five. At the very beginning, you said

"For example, we would expect any cell populations with a median fluorescence intensity (MFI) below 0 in a channel to be annotated as “Low” for that channel."

However, later you stated

"Expression values in a channel less than or equal to the value in the “Low” row are treated as low, by default, and not actively considered when FAUST processes the data."

So 0s are actually considered as active or not in your algorithm?

Thanks in advance for your help.

Allen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RGLab/FAUST/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSI6IUS6ABO3CKOXFFDZ3RH4OCLANCNFSM4LNCZDDQ .

zonglunli7515 commented 4 years ago

Hi Negative values are allowed and values are considered if they are admitted by the channel bounds. The increased computation time is caused by the large number of zeros and the fact that this introduces many ties in the data set. @evangreene can provide more insight to this. What is your goal with including the zeros? Knowing this might help us find a solution. Greg Finak On Tue, Mar 17, 2020, 00:03 zl7515 @.***> wrote: Hi, In step five of the work example, you set the lower bound to 0. Now I want zeros to be considered so I changed 0 to -0.01. However, it turned out that the code chunk in step seven keeps running without any warning/error message. The chunk has been running for more than 5 hours. Previously, with the lower boundary equal 0, the chunk was done within 10 minutes. So is it because there is too much computation involved as 0s are being considered, or simply because the algorithm doesn't accept negative values? P.S. I noted a tiny contradiction in step five. At the very beginning, you said "For example, we would expect any cell populations with a median fluorescence intensity (MFI) below 0 in a channel to be annotated as “Low” for that channel." However, later you stated "Expression values in a channel less than or equal to the value in the “Low” row are treated as low, by default, and not actively considered when FAUST processes the data." So 0s are actually considered as active or not in your algorithm? Thanks in advance for your help. Allen — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#15>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKSI6IUS6ABO3CKOXFFDZ3RH4OCLANCNFSM4LNCZDDQ .

Hi Greg,

Thanks for your continuing assistance.

It seems that the program kept running until the server was down.

Best,

Allen

evangreene commented 4 years ago

Hi @zl7515 ,

Can you provide some details about the dataset you are analyzing? In particular, how many samples are in the gating set, and how many cells (on average) are in the starting node of your analysis across experimental units?

When processing CyTOF datasets, you can set the lower bound in the channel bounds matrix to a negative value in order to explicitly model the spike at zero, but this will increase processing time. The increase in processing time can be ameliorated by increasing the setting of the threadNum parameter to a value supported by your server.

Thanks, Evan

zonglunli7515 commented 4 years ago

Hi @zl7515 ,

Can you provide some details about the dataset you are analyzing? In particular, how many samples are in the gating set, and how many cells (on average) are in the starting node of your analysis across experimental units?

When processing CyTOF datasets, you can set the lower bound in the channel bounds matrix to a negative value in order to explicitly model the spike at zero, but this will increase processing time. The increase in processing time can be ameliorated by increasing the setting of the threadNum parameter to a value supported by your server.

Thanks, Evan

Hi Evan,

Thanks for your speedy response. I'm now testing FAUST on a single sample with roughly 500,000 cells and 19 markers. I will change the thread number to maximum.

Best,

Allen

zonglunli7515 commented 4 years ago

Hi @zl7515 ,

Can you provide some details about the dataset you are analyzing? In particular, how many samples are in the gating set, and how many cells (on average) are in the starting node of your analysis across experimental units?

When processing CyTOF datasets, you can set the lower bound in the channel bounds matrix to a negative value in order to explicitly model the spike at zero, but this will increase processing time. The increase in processing time can be ameliorated by increasing the setting of the threadNum parameter to a value supported by your server.

Thanks, Evan

Hi Evan,

I have changed the thread number to maximum but it seems like my program is proceeding hopeless running again. I suspect that the algorithm is super slow when 0s are being considered and will be eventually kicked out of the server. It is unfair for other researchers so the server took some mandatory actions. I will try it on virtual machine later.

Have you ever run FAUST with 0s on any server?

Thanks.

Allen

evangreene commented 4 years ago

Hi Allen,

Yes I have processed CyTOF datasets with FAUST where I've modified the channel bounds in order to allow zeros to be modeled. It can be computationally demanding: how many threads are you able to allocate for the computation? Beyond the zeros, increased computational time may be a result of the modal structure of your dataset. However, it's difficult to say if that is occurring here without actually seeing the dataset itself.

Thanks, Evan

SamGG commented 4 years ago

Hi, @evangreene I thought Raphael told me that FAUST is a non parametric approach. Am I wrong? To avoid ties you could randomize zero using a uniform distribution between -1 and 0. Such an approach is advocated by E. Newell. Alternatively, I got a pre-processing (article in preparation) that reshapes the zero more softly. @zl7515 Allen, if you agree to share a data file with me, I could return you a processed file in order to do a new trial with FAUST. If the trial succeeds, I will give the software to work on your files confidentialy. Best, Samuel

gfinak commented 4 years ago

@SamGG FAUST already randomizes ties internally.

evangreene commented 4 years ago

Hi @SamGG,

FAUST is non-parametric. The channelBounds parameter controls the range of expression for which testing and density estimates are computed for a given marker. Values outside this range are treated as default "lowest" and "highest" expression categories (where the number of expression categories is determined by FAUST at the end of the procedure). Cells with values outside this range are not dropped from the dataset, but rather are ignored when mode testing is performed for an affected marker, and density estimates for the marker are computed. In flow cytometry datasets, this parameter can be used to incorporate information from controls (when they are available).

For CyTOF datasets where randomization has not been performed (by the instrument or by the analyst), FAUST will treat the exact 0's as the known "lowest" expression category by default. By modifying the channelBounds parameter to include negative values for each marker, FAUST will then randomize those values and include them when it tests for multimodality and computes density estimates (this happens multiple times when it computes the annotation bounds for the markers, as well as during phenotype discovery). The randomization performed by FAUST is needed since FAUST repeatedly uses the dip test of Hartigan & Hartigan to test the null of unimodality along a marker.

Hope this helps, Evan

SamGG commented 4 years ago

Thanks for your answers. I think you achieved your best. So, my questions are just curiosity, because I feel there a point I'm missing. a) would the computing time be the same if FAUST runs on a dataset of 500k points of flow cytometry and a dataset of 500k of mass cytometry, in both case all the 500k points are included in the bounds? b) in your hand/opinion, what is the benefit of including the zero rather then setting the lower bound at 0? c) is randomization done once or many times? I mean the randomization itself. Best, Samuel

zonglunli7515 commented 4 years ago

@SamGG @evangreene @gfinak Many thanks for your input and your offer for the data examination. I need to check with my boss.

Best,

Allen