Create a new check for negative fluorescence values

paulmueller commented 3 years ago

Old versions of Shape-In apparently sometimes stored negative values for fluorescence (fl1_max, fl2_max, fl3_max features). This is a problem, since fluorescence is usually plotted on a logarithmic scale and then the negative values are omitted.

Questions:

@phidahl How does the new version of Shape-In handle <=0 values for fluorescence? Are they just set to 1, or do you add a global offset to the entire fluorescence information (including the trace)?
Not sure whether this should be rated as an alert or as a violation. I guess it depends on the severity of the issue.
Not sure whether this should be an optional parameter in DCKit. Maybe the default, but the user should be able to deselect it?
Any comments from @ZELLMECHANIK-DRESDEN/collaborators are welcome.

Task:

create a new check function in https://github.com/ZELLMECHANIK-DRESDEN/dclab/blob/master/dclab/rtdc_dataset/check.py that checks for negative values in the fl?_max features
create another check function that checks the fl?_max_ctc features
create test functions

What you will learn:

how dclab handles fluorescence data
how dataset check functions work

Notes:

let me know if you need test data
This is a prerequisite of https://github.com/ZELLMECHANIK-DRESDEN/DCKit/issues/11

phidahl commented 3 years ago

Currently ShapeIn handles Flmax values that are <1 by writing 0.1 (its a float) instead of the actual value. This makes sure that log plotting works.

In maintenance of the device an offset can be adjusted per sensor to compensate a baseline deviation from 0.

As far as I know in Flow Cytometry there can be values below 0 for example if there is no or very weak fluorescence the values can also scatter below 0. Modern plotting tools handle this by scaling Biexponentally: https://www.flowjo.com/learn/flowjo-university/flowjo/before-flowjo/59 This scale behaves logarithmic for >1 and linear below. This scalling is very useful since otherwise populations <1 are not visible in log-plots or all points are just plotted on top of each other which withholds information.

Although on the device side we try to avoid values <0, I wouldn't handle fluorescence values < 1 or <0 as errors since these are most likely just belonging to the population around 0 with a certain spread.

MartaUrb commented 3 years ago

I confirm what Philipp said: ShapeIn assigns 0.1 to negative FLmax values.

This is not optimal, since you get a huge artificial peak at 0.1. In the data where you want to look also at the negative/low fluorescence cells, for example to define a threshold for FL+ gate or simply present it, this generates an issue.

After learning that, I was adjusting the offset in ShapeIn if I saw negative signal on any detector for baseline, but for already taken measurements this was an issue.

Maybe the negative values should be written as recorded and corrected by an offset in ShapeOut, rather than assigning a value 0.1 to all already in ShapeIn?

paulmueller commented 3 years ago

I see, so the real solution would actually be to implement #56, because it really only affects plotting.

phidahl commented 3 years ago

Well, cutting off at 0 makes sense from the point of view how the sensor works. There cannot be any negative signal. There is at least 100 ADC noise so putting a threshold below that is questionable in general.

Then the peak finding in ShapIn assumes the baseline at 0. So any values below 0 can in the best case be interpreted as "negative". As the there is no quantitative information in the value, it can be replaced by 0.1.

paulmueller commented 3 years ago

@phidahl I obviously don't have as much insight, but if there is on average 100 ADC noise and Shape-In subtracts 100 to get a baselinie around 0, then there might be events where the e.g. the ADC noise is 50 and the signal is 30, which becomes -20. I think this -20 is what @MartaUrb would still like to see on scatter plots.

phidahl commented 3 years ago

Firstly the noise from the sensor (single photon peaks) are always positive (relative to the base line). See the attached screenshot of raw sensor voltage. Not that here the voltage is inverted. The peaks a caused by single photon events. and have a fixed area. (The are also integer multiples) So also if it would be nice to have a Gaussian distribution around zero, there should not be one with a properly working setup.

With baseline I mean the baseline of the sensor raw voltage in the image above. This is adjusted to match 0 ADC or even a bit more to be sure. not the average over time.

Secondly the fl_max is the result of the peak finding algorithm which will (if nothing goes wrong) always deliver positive values. fl_max is as it names says an operation that looks for an extremal value, so it is natural that the distribution will look different than the distribution of sampled values in the raw signal. I think this is similar to deformation values which by definition cannot be 0, which causes special behavior in scatter plots.

In future releases there will be an automatic determination of the base line, but anyway I would suggest to live with skewed distributions at 0. I'm not sure what one wants to learn from the distribution of values in the negative population. For proper fluorescence detection we recommend aiming for (positive) fluorescence signals > 1000 which makes a possible shift of the negative population mean by +/- 20 small and negligible.

MartaUrb commented 3 years ago

Just to throw it in, I had times when the baseline FLmax value (showed in the ShapeIn/Out) was around -40. I don't know what are the instrumental reasons for such shift of the baseline, but if in the future the baseline is found automatically this will solve the problem.

NegativeFL_Baseline_exp1

paulmueller commented 3 years ago

@phidahl I can see two issues with getting a "huge artificial peak at 0.1" (as Marta named it):

supervised machine learning does probably have issues dealing with such a distribution
visualization: KDE computation is very difficult, distribution is difficult to understand, plots don't look nice

MartaUrb commented 3 years ago

I agree with what Paul just said. Another example: in my transient over-expression experiments, where only a few cells were successfully transfected and FL-positive, I was using the distribution of negative population to orient my cutoff (top row are negative control cells). This is especially useful strategy in case of experiment-to-experiemnt variation of FL values (because of say optics misalignment or also settings), as it feels weird to set a cutoff arbitrary every time. Compare the two experiments below, where I have set the FL offset almost an order of magnitude higher in the 2nd experiment.

experiment 1 20200304_TGBC_FL2histos

experiment 2 20200314_TGBC_FL2histos

MartaUrb commented 3 years ago

And for the record, this is how the histograms look if there is the "most negative cells are assigned to 0.1 value" issue. There is just this line at 0.1 ..... (it is probably cut even here and doesn't show it's full height) 20191219_TGBC_FL2histos

phidahl commented 3 years ago

Hi Marta, the screenshot shows the trace from the sensor. The baseline of the sensor signal is between -40 and -50 (estimated) here. The offset should have been set be set to a value +45 higher than it has been in the Maintenance settings. Otherwise small peaks won't be detected properly. And peak width shouldn't work at all values in that range. For large peaks there would still be a relatively smaller error.

To your second post: I understand that a nicer looking distribution would cause less questions in the interpretation. I guess setting a correct offset in the future automatically would be the best solution. I'm going to implement this for the next release. So that this won't be an issue for future measurement.

How would you determine the cut-off? By saying 99% or x% of the events of the neg. ctl. need to be below it? - In this case the ugly bar at 0.1 wouldn't harm at all. I think it's always best to assume as little as possible, such as the underlying distribution. (The fit was log-normal, right?)

The large differences between the two experiments are (in my opinion) most likely from autofluorescence of the medium or very different laser settings. Were these two experiments done with the same device?

Hi Paul, if the device settings were not ideal, there might be ugly data, yes. The only way to repair the data would be to reevaluate the peaks with an appropriate offset.

Maybe it would be best, to find a way that these algorithms could deal with this, or to re-evaluate the peaks.
For plotting I understand the point why one could tag these data as invalid or unplottable. But ugly plots seem an odd reason to change the way data is determined. I still think that in these cases something went wrong (wrong offset) and this is the reason it looks ugly.
The set to 0.1 is a fallback for wrong input (baseline of detector not correct) It would be best to either throw a warning during measurement, or to determine the setting automagically, which I'm going to do.

I hope that everybody is fine with that.

phidahl commented 3 years ago

And as suggested by @paulmueller here, I would now vote for an "alert" if fl_max < 1 is encountered, it indicates bad settings concerning the fl offset. This might not mean that the data cannot be used at all, but it might need inspection by the user.

paulmueller commented 3 years ago

OK, this is what I take away from this discussion:

In the future we won't have this problem, because Shape-In will correctly set the offset.
For old data, we should issue an "alert" for negative fl_max values (this is the original issue here)
I guess it would also make sense to warn the user for 0.1-values

Now, if we want to "repair" old data (which would be done in https://github.com/ZELLMECHANIK-DRESDEN/DCKit/issues/11), then we should:

Determine the global offset by fitting a Poisson distribution or taking a median or whatever from all traces
Only correct offsets that are <0 (as per @chrherold's comment below).
Add this offset to all the traces
Recompute fl_max for all events (we can just extract them, because we have the fl_pos feature)

@phidahl Do you see any issues with this approach?

chrherold commented 3 years ago

Couple things to add:

the "repairing" data for essentially "0" values will be a bit weird. because there will not be a "true" peak to find, It will be picking some noise value somewhere. But it could be done the way you suggest. Just keep in mind that the "baseline" may shift depending on the amount of auto-fluorescence or other background light collection specific to a sample (It may not be the "dark" baseline). If you constrict "repair" to <0 baselines this will not be an issue in practice.

dealing with <0 fl_max values will always become an issue after fl cross-talk corrections. So plotting and evaluation routines should anyway be able to deal with it.

maxschloegel commented 3 years ago

I just realized that by introducing new dataset-checks, such as checks for negative fl?_max values, one of the test-functions got broken, namely test_exact(). In test_exact() the produced warnings are compared to a set of hard coded warnings and differences between the two lead to failure of the test. These differences are introduced by the new dataset_checks. @paulmueller what is the best approach to this? I could add the check-alert to the hard coded alerts, but this does not seem efficient and might lead to bloated code when repeated more often. Any other idea I have (e.g. removing the new check-alert by hand in the test_exact-function) lead to the same bloating. I could create a new dataset on which the test_exact()-function runs, but that would take me some time to generate a good one.

paulmueller commented 3 years ago

I would update the hard-coded warnings in the test_exact() function. It's also a nice way of keeping track how data quality improves over time :smile:.

maxschloegel commented 3 years ago

Ok! Nice way to look at it :)

DC-analysis / dclab

Create a new check for negative fluorescence values #101