Use Genotypes generated in a discovery set to deconvolute new metagenomes

Hi Sergio!

Yes, fitting genotypes in one dataset and then their abundances in a second is possible and it's something I've experimented with. I'll try to describe how I would go about it, but first, a couple thoughts based on my experience with this same challenge:

We are trying to deconvolute strains in a large population of metagenomes (over 10,000). After processing and merging GT-Pro results, I found it really challenging to fit a model in this amount of samples.

While the batch fitting of >10,000 metagenomes is quite a bit of memory for a personal computer, I've found the bigger challenge in that case to be run-time. It'll definitely be much more feasible on a larger server and with a modern GPU (10x the speed or more), in which case I've been able to fit similarly sized problems (~10,000 metagenomes, ~5,000 SNP positions, ~1,000 strains) in less than 7 GB of GPU memory and a few hours of total runtime.

I'm still actively developing StrainFacts for my own uses. Assuming you're using v0.3 or the GitHub HEAD, I would definitely suggest you use model4, since this is a simplified model that requires less memory and runs much faster. ...Unfortunately, it's not the model benchmarked in the paper and the performance / hyper-parameters are different.

Even trying to split in SNP chunks still requires large amounts of memory for reading the metagenotypes in the first place.

I take it you were eventually able to overcome this limitation? If not, there may be some ways to play around with xarray.open_dataset versus xarray.load_dataset to try and avoid loading the entire matrix into memory. See the loading code here. Let me know if this is still a major issue and I can try and help you find a solution.

Ignoring computational limitations, getting the model itself to fit this scale of data well is also very challenging. I've definitely struggled to get realistic relative abundance inferences for e.g. Faecalibacterium prausnitzii in the 10k metagenome dataset described in the paper. I'm slowly coming to the conclusion that the strain diversity for F.p. in this dataset is absolutely enormous. Like >5,000 distinguishable strains across 10,000 samples 🤯. So when I try to fit <1,000 strains, many samples simply aren't represented. The model seems to compensate by inferring very heterogeneous samples with multiple strains in order to try and explain the observed alleles.

A few ideas about how to improve fits under these circumstances:

Lots more strains. Like 5,000. Unfortunately, this will be much slower and require much more GPU memory.
In addition, numerical precision becomes a problem with the relative abundances across this many strains, so you may also have to set the --precision 64 option, which will double your memory and runtime. It's possible that adding the kwarg validate_args=False to the distribution constructor here would get around this, but I haven't tried that yet.
Also I've been using less diversity regularization with --hyperparameters rho_hyper=10.0 rho_hyper2=10.0 and less ambiguity regularization (fuzzier genotypes) with --hyperparameters gamma_hyper=1e-10.
Slower gamma_hyper annealing using --anneal-hyperparameters gamma_hyper=0.999 --anneal-steps 50000 seems to help in some cases, as well.

I was wondering whether it is possible to use StrainFracts to define genotypes in a "discovery" subset of the samples

So, yes, but ultimately—if you've got a similarly mind-blowing amount of diversity—you'll miss many of the relevant strains in the discovery subset and then the refitting will be lacking. Same problem with looking for strains that match reference genomes. I think our databases are simply incomplete.

On the other hand, if you've got lots of repeated sampling of the same subjects (like in the HMP2 IBD dataset, for instance), you may be able to get around this problem and your inferences will be much better.

I have been digging around but I have not been able to find such feature. Is this possible to be done?

Okay, so the answer to your actual question. Here's what I would suggest:

If you're willing to do a bit of hacking, the Python API is flexible but poorly documented 😬. The command-line interface is a fairly thin wrapper around workflow functions and these demonstrate how to build a ParameterizedModel and estimate_parameters.

Here's an (un-tested) sketch of what that might look like:

mgen_discovery = sf.Metagenotype.load(DISCOVERY_MGEN_PATH)
mgen_full = sf.Metagenotype.load(FULL_MGEN_PATH)

pmodel0 = sf.model.ParameterizedModel(
    sf.model_zoo.NAMED_STRUCTURES['model4'],
    coords=dict(
        sample=mgen_discovery.sample.values,
        position=mgen_discovery.position.values,
        allele=mgen_discovery.allele.values,
        strain=range(1000),
    ),
    hyperparameters=dict(
        gamma_hyper=1e-10,
        rho_hyper=10.,
        rho_hyper2=10.,
        pi_hyper=0.1,
        pi_hyper2=0.1,
    ),
    device='cuda',
    dtype='float32',
).condition(**mgen_discovery.to_counts_and_totals())

est0, history0 = sf.estimation.estimate_parameters(
    pmodel0,
    device='cuda',
    dtype='float32',
    # # Should probably also include:
    # anneal_hyperparameters=anneal_hyperparameters,
    # annealiter=50000,
    # **estimation_kwargs,
)

pmodel1 = pmodel0.with_amended_coords(
    sample=mgen_full.sample.values,   # Update sample coords.
).condition(
  gamma=est0.genotype.values          # Condition on estimated genotypes.
  **mgen_full.to_counts_and_totals()  # Introduce full data.
)

est1, history1 = sf.estimation.estimate_parameters(
    pmodel1,
    device='cuda',
    dtype='float32',
    # # Should probably also include:
    # anneal_hyperparameters=anneal_hyperparameters,
    # annealiter=50000,
    # **estimation_kwargs,
)
est0.dump(DISCOVERY_OUTPATH)
est1.dump(FULL_FIT_OUTPATH)

If this approach proves useful to you, I would definitely consider adding to as a CLI subcommand. As of yet, however, I have not had much success with workflows like this. Instead, I've been more focused on scaling the traditional fitting up to datasets of this size.

I'd also be very interested in exploring the dataset you're working on (and particularly if it includes repeated sampling of the same subjects). If you would like to consider more formal collaboration do reach out. My email is byron.smith@gladstone.ucsf.edu.

bsmith89 / StrainFacts

Use Genotypes generated in a discovery set to deconvolute new metagenomes #4