bsmith89 / StrainFacts

Factorize metagenotypes to infer strains and their abundances
MIT License
11 stars 1 forks source link

Running model evaluations for non simulation data #11

Open elximo opened 1 year ago

elximo commented 1 year ago

Hello Byron,

Thanks for your help with the loading and fitting the data. I need some help getting the evaluation metrics generated. Since I have no ground truth, what should be the correct way to obtain them? I am using the same dataset I shared with you in #7

When I run the following command sfacts evaluate_fit --outpath SampleM.eval_all_fits.tsv SampleM.filt.mgen.nc SampleM.filt.ss-0.fit.world.nc SampleM.filt.ss-0.fit2.world.nc SampleM.filt.fit3.world.nc

I got the following error

Traceback (most recent call last):
  File "/Users/username/mambaforge/envs/sfacts-dev/bin/sfacts", line 33, in <module>
    sys.exit(load_entry_point('StrainFacts', 'console_scripts', 'sfacts')())
  File "/Users/username/software/StrainFacts/sfacts/app/__init__.py", line 1116, in main
    args._subcommand(args)
  File "/Users/username/software/StrainFacts/sfacts/app/components.py", line 254, in __init__
    self.run(args)
  File "/Users/username/software/StrainFacts/sfacts/app/__init__.py", line 1049, in run
    metrics = sf.workflow.evaluate_fit_against_metagenotype(ref, fit)
  File "/Users/username/software/StrainFacts/sfacts/workflow.py", line 203, in evaluate_fit_against_metagenotype
    ref = ref.sel(position=fit.position.astype(int), sample=fit.sample.astype(int))
  File "/Users/username/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/common.py", line 1331, in astype
    return apply_ufunc(
  File "/Users/username/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py", line 1204, in apply_ufunc
    return apply_dataarray_vfunc(
  File "/Users/username/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py", line 315, in apply_dataarray_vfunc
    result_var = func(*data_vars)
  File "/Users/username/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py", line 771, in apply_variable_ufunc
    result_data = func(*input_data)
  File "/Users/username/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/duck_array_ops.py", line 187, in astype
    return data.astype(dtype, **kwargs)
ValueError: invalid literal for int() with base 10: 'sample3'

When I ran the same commands as in the evaluation notebook

sim = sf.World.load('SampleM.filt.mgen.nc')
fit = sf.World.load('SampleM.filt.ss-0.fit.world.nc')
fit = sf.World(fit.data.assign_coords(position=fit.position.astype(int)))

sim = sim.sel(position=fit.position.astype(int), sample=fit.sample.astype(int))

I get exactly the same exception with a more elaborate trace


sim = sim.sel(position=fit.position.astype(int), sample=fit.sample.astype(int))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 sim = sim.sel(position=fit.position.astype(int), sample=fit.sample.astype(int))

File ~/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/common.py:1331, in DataWithCoords.astype(self, dtype, order, casting, subok, copy, keep_attrs)
   1328 kwargs = dict(order=order, casting=casting, subok=subok, copy=copy)
   1329 kwargs = {k: v for k, v in kwargs.items() if v is not None}
-> 1331 return apply_ufunc(
   1332     duck_array_ops.astype,
   1333     self,
   1334     dtype,
   1335     kwargs=kwargs,
   1336     keep_attrs=keep_attrs,
   1337     dask="allowed",
   1338 )

File ~/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py:1204, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, *args)
   1202 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1203 elif any(isinstance(a, DataArray) for a in args):
-> 1204     return apply_dataarray_vfunc(
   1205         variables_vfunc,
   1206         *args,
   1207         signature=signature,
   1208         join=join,
   1209         exclude_dims=exclude_dims,
   1210         keep_attrs=keep_attrs,
   1211     )
   1212 # feed Variables directly through apply_variable_ufunc
   1213 elif any(isinstance(a, Variable) for a in args):

File ~/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py:315, in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    310 result_coords, result_indexes = build_output_coords_and_indexes(
    311     args, signature, exclude_dims, combine_attrs=keep_attrs
    312 )
    314 data_vars = [getattr(a, "variable", a) for a in args]
--> 315 result_var = func(*data_vars)
    317 out: tuple[DataArray, ...] | DataArray
    318 if signature.num_outputs > 1:

File ~/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/computation.py:771, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    766     if vectorize:
    767         func = _vectorize(
    768             func, signature, output_dtypes=output_dtypes, exclude_dims=exclude_dims
    769         )
--> 771 result_data = func(*input_data)
    773 if signature.num_outputs == 1:
    774     result_data = (result_data,)

File ~/mambaforge/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/duck_array_ops.py:187, in astype(data, dtype, **kwargs)
    185     xp = get_array_namespace(data)
    186     return xp.astype(data, dtype, **kwargs)
--> 187 return data.astype(dtype, **kwargs)

ValueError: invalid literal for int() with base 10: 'sample3'

Since the exception is not triggered from your code then it is likely I am doing something wrong with the choice of the files or I should have used a different function other than sim.sel(). Could you please point me to the right function to use (or the right sfacts evaluate_fit command to run) for my case? This is greatly appreciated.

bsmith89 commented 1 year ago

Hey @elximo, thanks for your patience and useful bug report.

I'm taking a look at your error now. evaluate_fit isn't the most carefully designed sub-command, so I may have assumed that samples were indexed by integers, as in the simulation. If that's it, the fix should be pretty straight-forward, and I'll push a patch.

Stay tuned.

bsmith89 commented 1 year ago

Okay, I think I may have fixed your bug. Please try running your example again with the most recent commit, and let me know if it works. If not, I'll be happy to do some more digging with your minimal example in hand.

You also asked:

I need some help getting the evaluation metrics generated. Since I have no ground truth, what should be the correct way to obtain them?

Evaluate fit has only one accuracy index that can be run without ground-truth. It's what I've been calling the "metagenotype error", and is effectively the absolute difference between the observed number of counts for each allele and the expected number. You can find the code for that function, here. Some types of fitting problems will show up as large metagenotype errors (both overall and within individual samples).

The second approach is less formal, and involves visual inspection of the metagenotypes and inferences. Take a look at the evaluation example notebook for some ideas about how you might do that. Unfortunately I haven't yet found any one-size-fits-all way to assess model performance or tuning, and manual inspection is still my standard approach.

While I have yet to update the default settings, I have recently been getting the most consistent performance with the following fitting parameters:

python3 -m sfacts fit \
    --model-structure model4
    --num-strains <S> \
    --hyperparameters gamma_hyper=1e-15 pi_hyper=0.01 pi_hyper2=0.01 rho_hyper=1.0 rho_hyper2=1.0
    --anneal-hyperparameters gamma_hyper=0.999
    --anneal-steps 120000
    --optimizer-learning-rate 0.05
    --min-optimizer-learning-rate 1e-2
    -- <INPATH> <OUTPATH>

With --num-strains equal to about double the number of strains you expect.

If you are fitting on a CPU, you might need to reduce the number of annealing steps. (Try 10,000 to start, and increase the number of steps if it is fast enough.)

Hope this helps!