ECOS deprecation and increasing frequency of demix errors

ktmeaton commented 1 month ago

As we update our freyja dataset to barcodes released in April and May, we are observing an increasing frequency of errors during demix even when using the --depthcutoff parameter. These errors don't appear when running the same samples on earlier datasets (ex. March). I'm wondering if the new JN sublineages are taking us into strange edge case territory for certain samples/parameters? Perhaps this is also related to issue https://github.com/andersen-lab/Freyja/issues/237?

The errors are of the variety Reached NAN dead end and RAN OUT OF ITERATIONS:

It     pcost       dcost      gap   pres   dres    k/t    mu     step   sigma     IR    |   BT
 0  +1.645e-15  +1.810e-13  +3e+04  9e-01  7e-01  1e+00  2e+00    ---    ---    1  1  - |  -  -
 1  +2.203e+00  +3.867e+00  +2e+04  5e-01  2e-01  2e+00  1e+00  0.6023  3e-01   2  1  2 |  0  0
 2  +2.669e+00  +3.368e+00  +4e+03  2e-01  3e-02  7e-01  3e-01  0.8662  9e-02   2  2  1 |  0  0
...
34  +6.429e+00  +6.429e+00  +4e-03  5e-07  3e-08  5e-08  3e-07  0.9890  2e-01   3  2  2 |  0  0
35  +6.429e+00  +6.429e+00  +3e-04  4e-08  3e-09  5e-09  2e-08  0.9224  1e-02   5  3  3 |  0  0
36   -nan   +nan  -nan  -nan  -nan  -nan  -nan  0.9890  1e-04   1  0  0 |  0  0
Reached NAN dead end, recovering best iterate (35) and stopping.

RAN OUT OF ITERATIONS (reached feastol=4.0e-08, reltol=5.1e-05, abstol=3.3e-04).
Runtime: 47.111861 seconds.

demix: Solver error encountered, mostlikely due to insufficient sequencing depth.Try increasing the --depthcutoff parameter.

I've found an assortment of related ECOS issues, that seem to concern numerical instability:

Odd numerical issue: https://github.com/embotech/ecos/issues/81
ECOS solver fails to converge on well-conditioned problem: https://github.com/embotech/ecos/issues/187
CVXPY sending inf down to solvers: https://github.com/cvxpy/cvxpy/issues/1470

And when re-installing freyja for this issue, I saw the new CVXPY warning of ECOS's deprecation:

You specified your problem should be solved by ECOS. Starting in
CXVPY 1.6.0, ECOS will no longer be installed by default with CVXPY.
Please either add an explicit dependency on ECOS or switch to our new
default solver, Clarabel, by either not specifying a solver argument
or specifying ``solver=cp.CLARABEL``.

The CVXPY maintainers also cite numerical instability in their ECOS deprecation post, and recommend Clarabel as ECOS's replacement. In my initial testing, it seems that Clarabel does not raise these errors and is a pretty simple swap:

# sample_deconv.py: L170
prob.solve(verbose=True, solver=cp.CLARABEL)

Questions

Have you observed errors of this nature before?
Do you have any recommendations about using Clarabel instead of ECOS?

To Reproduce

I'm on the hunt for a publicly available sample that has the same error to share. If I can find one in the CDC's wastewater dataset, I'll upload the variants and depths.

I'm using freyja=1.5.0 from conda, which now has cvxpy=1.5.1. For comparative barcodes, I'm using 04_02_2024-00-49 (works) and 04_24_2024-00-49 (errors).

I've tested a range of --depthcutoff: (0, 1, 10, 30, 100, 200, 300, 400, 500, 1000). It seems sample-specific which exact cutoff will cause the error with ECOS. But interestingly with Clarabel, all coverages work (no errors so far).

freyja demix \
  --depthcutoff 30--covcut 30 \
  --output sample..demix \
  --barcodes usher_barcodes.csv \
  --meta curated_lineages.json \
  --lineageyml lineages.yml \
  sample.variants.tsv \
  sample.depth.tsv

joshuailevy commented 1 month ago

Hey @ktmeaton!

Thanks so much for bringing this up, and sorry for the delay in getting back to you- you caught us in a period of lots of travel and multiple papers wrapping up, and I haven't been able to carve out the time to properly respond.

We definitely have noticed errors like the ones you've mentioned (in part why we added the error message suggesting users try the --depthcutoff option), but have stuck with ECOS primarily for sake of continuity and interpretability (for the most part the answers it returns are generally in line with expectations), more than anything. That's very interesting that Clarabel is able to solve the problem under all of the tested conditions- I'll test it out over the next few days (will send an example test file as well!) with some example files on our end, I'm curious as to how it breaks ``ties" when multiple lineages have the same effective barcodes provided the available coverage. If it more or less functions the same, but just suffers less numerical instability/converges more reliably, I'm fully in favor of a like-for-like swap.

Will follow up soon! Josh

joshuailevy commented 3 weeks ago

Thanks again for bringing this up @ktmeaton. I went ahead and did a bunch of testing, including using the data from #237. As you mention, in many of the cases where ECOS fails, Clarabel manages to converge or provide a near-solution (i.e., below the tolerance). The results are identical in most cases, and they behave in the same way when they are unable to distinguish between lineages (without using --depthcutoff), providing equivalent estimated lineage prevalences for these lineages. I also did some testing of other solvers, including OSQP, which also seemed to outperform ECOS but was a bit slower.

I've just made a new Freyja release that includes Clarabel as the default solver, and includes an option to try the other two. :)

ktmeaton commented 1 week ago

Thank you for the update! I just finished testing the v1.5.1 release, and it seems to fix all these errors. Thanks so much!

andersen-lab / Freyja

ECOS deprecation and increasing frequency of demix errors #238

Questions

To Reproduce