Failed positive_counts constraint when loading metagenotype data from TSV with missing values.

elximo commented 1 year ago

Hi there, I have been trying to run Strain Facts. The simulation data in the example run fine (with some Runtime warnings). However, when I try to run on data from GT-Pro sample example, I have errors on the first fitting as follows. I made sure I formatted the table and loaded it as mentioned in #5

This is the command I run

sfacts fit \
    --verbose \
    --num-strains 3 \
    --random-seed 1 \
    firstSample.$species.filt.ss-0.mgen.nc firstSample.$species.filt.ss-0.fit.world.nc

and after about few seconds from running, I get the following exception.

2023-04-27 11:25:47,001 START: Fitting 3 strains with data shape Frozen({'sample': 1, 'position': 11972, 'allele': 2}).
  0%|| 0/1000000 [00:00<?, ?it/s]/server/home/username/software/StrainFacts/sfacts/model_zoo/components.py:310: UserWarning: Using LogTriangle as an approximation for random sampling. This is probably a bad idea.
  warnings.warn(
  0%|| 3481/1000000 [00:13<1:05:43, 252.69it/s, NLP=+9.96e+04, delta=+0.00e+00, lag100=+0.00e+00, lr=+7.63e-07]
2023-04-27 11:26:00,791 Converged: NLP=9.96485e+04
/server/home/username/software/StrainFacts/sfacts/model_zoo/components.py:310: UserWarning: Using LogTriangle as an approximation for random sampling. This is probably a bad idea.
  warnings.warn(
Traceback (most recent call last):
  File "/server/home/username/miniconda3/envs/sfacts-dev/bin/sfacts", line 33, in <module>
    sys.exit(load_entry_point('StrainFacts', 'console_scripts', 'sfacts')())
  File "/server/home/username/software/StrainFacts/sfacts/app/_init_.py", line 1129, in main
    args._subcommand(args)
  File "/server/home/username/software/StrainFacts/sfacts/app/components.py", line 254, in _init_
    self.run(args)
  File "/server/home/username/software/StrainFacts/sfacts/app/_init_.py", line 667, in run
    est, history = sf.workflow.fit_metagenotype_complex(
  File "/server/home/username/software/StrainFacts/sfacts/workflow.py", line 99, in fit_metagenotype_complex
    est, history = sf.estimation.estimate_parameters(
  File "/server/home/username/software/StrainFacts/sfacts/estimation.py", line 294, in estimate_parameters
    model.with_hyperparameters(
  File "/server/home/username/software/StrainFacts/sfacts/model.py", line 306, in format_world
    out[k] = xr.DataArray(
  File "/server/home/username/miniconda3/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/dataarray.py", line 418, in _init_
    coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
  File "/server/home/username/miniconda3/envs/sfacts-dev/lib/python3.9/site-packages/xarray/core/dataarray.py", line 132, in _infer_coords_and_dims
    raise ValueError(
ValueError: different number of dimensions on data and dims: 1 vs 2

Let me know if you need any details. Thanks.

bsmith89 commented 1 year ago

Hi! Thanks for submitting a report.

You say:

When I try to run on data from GT-Pro sample example [...] I made sure I formatted the table and loaded it as mentioned in https://github.com/bsmith89/StrainFacts/issues/5

Do you mean you're working with data from the GT-Pro tutorial or with your own data that you've obtained from GT-Pro? Either way, what are its dimensions after filtering?

Can you either send me the file firstSample.$species.filt.ss-0.mgen.nc that fails, or tell me the output of sfacts data_info firstSample.$species.filt.ss-0.mgen.nc?

The name of your metagenotype file suggests that it may be from only one sample rather than a concatenation of multiple. The error also suggests that the estimated parameters have an unexpected dimensionality. I'm thinking that maybe you either passed the metagenotype of one sample or filtering removed all but one...?

Let me know if that's it. :)

Thanks!

elximo commented 1 year ago

Thanks a lot for your quick response!

The name of your metagenotype file suggests that it may be from only one sample rather than a concatenation of multiple.

Yep. It is just one sample. This is also the same thing I was doing with my own data.

Either way, what are its dimensions after filtering?

For with the GT-Pro example, all species got heavily filtered (leaving like 60 alleles), so I resorted to using the full set (replacing the filtering command with cp to keep consistency of names for pipeline flow).

So now I am back to using my own data. When I load each sample separately (using the straightforward metagenotype 4-column format), it works fine. I merged 2 samples, keeping just one header. When I load the merged file I now get the following error (Also using different machine; MacOS M1 running conda with rosetta)

Traceback (most recent call last):
  File "/Users/username/mambaforge/envs/sfacts-dev/bin/sfacts", line 33, in <module>
    sys.exit(load_entry_point('StrainFacts', 'console_scripts', 'sfacts')())
  File "/Users/username/software/StrainFacts/sfacts/app/__init__.py", line 1116, in main
    args._subcommand(args)
  File "/Users/username/software/StrainFacts/sfacts/app/components.py", line 254, in __init__
    self.run(args)
  File "/Users/username/software/StrainFacts/sfacts/app/__init__.py", line 92, in run
    values.append(sf.data.Metagenotype.load_from_tsv(args.metagenotype))
  File "/Users/username/software/StrainFacts/sfacts/data.py", line 110, in load_from_tsv
    return cls._post_load(data)
  File "/Users/username/software/StrainFacts/sfacts/data.py", line 102, in _post_load
    result.validate_constraints()
  File "/Users/username/software/StrainFacts/sfacts/data.py", line 191, in validate_constraints
    constraint.raise_error(self.data)
  File "/Users/username/software/StrainFacts/sfacts/data.py", line 42, in raise_error
    raise DataConstraintError(f"Failed constraint: {self.name}")
sfacts.data.DataConstraintError: Failed constraint: positive_counts

Could this be an issue with how I generated the counts (metagenotype) column? I am using ratios and rounding them to 10 like the example from simulation. I appreciate your help and guidance.

bsmith89 commented 1 year ago

Yep. It is just one sample. This is also the same thing I was doing with my own data.

I believe your original error (ValueError: different number of dimensions on data and dims: 1 vs 2) was because you only included one sample in your data. However, I will know better if you can send me the data file or at least the output of sfacts data_info firstSample.$species.filt.ss-0.mgen.nc.

StrainFacts requires data from multiple samples to operate, not just one. Apologies for the cryptic error message. I'll add a check for data dimensionality before filtering. #8

So now I am back to using my own data. When I load each sample separately (using the straightforward metagenotype 4-column format), it works fine. I merged 2 samples, keeping just one header. When I load the merged file I now get the following error

This is a different issue from before, and the error sfacts.data.DataConstraintError: Failed constraint: positive_counts suggests that it's a problem with the values you're using. The model is intended to fit "raw" metagenotypes, so not ratios but the counts of reads where each SNV is observed. If your data file includes negative values, missing values, NaNs, or infs, that would explain the error you're getting.

I will be able to diagnose both of these issues more confidently if you can send me your example files.

elximo commented 1 year ago

I believe your original error (ValueError: different number of dimensions on data and dims: 1 vs 2) was because you only included one sample in your data. However, I will know better if you can send me the data file or at least the output of sfacts data_info firstSample.$species.filt.ss-0.mgen.nc.

The output is firstSample.100196.filt.ss-0.mgen.nc 1 11972 2

I will be able to diagnose both of these issues more confidently if you can send me your example files.

Here I merged the two samples together. I also fixed the metagenotype to use allele read count. But I still get the same positive_counts error while trying to load.

SampleMerge_core_150bp-DP10_strainFacts.txt.gz

bsmith89 commented 1 year ago

Okay, found your problem. There are positions in the metagenotype for sample3 that aren't in sample4 and vice-versa. As a result, when the data gets loaded these missing values are NaNs, which don't pass the positive_counts check on the data.

I believe the correct/intuitive behavior would have been to fill those NaNs with 0s on loading. I'll push a patch fixing this bug momentarily.

bsmith89 / StrainFacts

Failed positive_counts constraint when loading metagenotype data from TSV with missing values. #7