bsmith89 / StrainFacts

Factorize metagenotypes to infer strains and their abundances
MIT License
11 stars 1 forks source link

Use Genotypes generated in a discovery set to deconvolute new metagenomes #4

Open sergioSEa opened 2 years ago

sergioSEa commented 2 years ago

Dear Byron,

First of, I wanted to thank you for developing this great piece of software. We are really excited to use it coupled with the metagenotypes generated from GT-Pro.

We are trying to deconvolute strains in a large population of metagenomes (over 10,000). After processing and merging GT-Pro results, I found it really challenging to fit a model in this amount of samples. Even trying to split in SNP chunks still requires large amounts of memory for reading the metagenotypes in the first place.

I was wondering whether it is possible to use StrainFracts to define genotypes in a "discovery" subset of the samples, and use those genotypes to quantify strains in the rest. I have been digging around but I have not been able to find such feature. Is this possible to be done? Or alternative, is this a feature that would be interesting to implement in the future?

Thanks for your help!

Sergio.

bsmith89 commented 2 years ago

Hi Sergio!

Yes, fitting genotypes in one dataset and then their abundances in a second is possible and it's something I've experimented with. I'll try to describe how I would go about it, but first, a couple thoughts based on my experience with this same challenge:

We are trying to deconvolute strains in a large population of metagenomes (over 10,000). After processing and merging GT-Pro results, I found it really challenging to fit a model in this amount of samples.

While the batch fitting of >10,000 metagenomes is quite a bit of memory for a personal computer, I've found the bigger challenge in that case to be run-time. It'll definitely be much more feasible on a larger server and with a modern GPU (10x the speed or more), in which case I've been able to fit similarly sized problems (~10,000 metagenomes, ~5,000 SNP positions, ~1,000 strains) in less than 7 GB of GPU memory and a few hours of total runtime.

I'm still actively developing StrainFacts for my own uses. Assuming you're using v0.3 or the GitHub HEAD, I would definitely suggest you use model4, since this is a simplified model that requires less memory and runs much faster. ...Unfortunately, it's not the model benchmarked in the paper and the performance / hyper-parameters are different.

Even trying to split in SNP chunks still requires large amounts of memory for reading the metagenotypes in the first place.

I take it you were eventually able to overcome this limitation? If not, there may be some ways to play around with xarray.open_dataset versus xarray.load_dataset to try and avoid loading the entire matrix into memory. See the loading code here. Let me know if this is still a major issue and I can try and help you find a solution.

Ignoring computational limitations, getting the model itself to fit this scale of data well is also very challenging. I've definitely struggled to get realistic relative abundance inferences for e.g. Faecalibacterium prausnitzii in the 10k metagenome dataset described in the paper. I'm slowly coming to the conclusion that the strain diversity for F.p. in this dataset is absolutely enormous. Like >5,000 distinguishable strains across 10,000 samples 🤯. So when I try to fit <1,000 strains, many samples simply aren't represented. The model seems to compensate by inferring very heterogeneous samples with multiple strains in order to try and explain the observed alleles.

A few ideas about how to improve fits under these circumstances:

I was wondering whether it is possible to use StrainFracts to define genotypes in a "discovery" subset of the samples

So, yes, but ultimately—if you've got a similarly mind-blowing amount of diversity—you'll miss many of the relevant strains in the discovery subset and then the refitting will be lacking. Same problem with looking for strains that match reference genomes. I think our databases are simply incomplete.

On the other hand, if you've got lots of repeated sampling of the same subjects (like in the HMP2 IBD dataset, for instance), you may be able to get around this problem and your inferences will be much better.

I have been digging around but I have not been able to find such feature. Is this possible to be done?

Okay, so the answer to your actual question. Here's what I would suggest:

If you're willing to do a bit of hacking, the Python API is flexible but poorly documented 😬. The command-line interface is a fairly thin wrapper around workflow functions and these demonstrate how to build a ParameterizedModel and estimate_parameters.

Here's an (un-tested) sketch of what that might look like:

mgen_discovery = sf.Metagenotype.load(DISCOVERY_MGEN_PATH)
mgen_full = sf.Metagenotype.load(FULL_MGEN_PATH)

pmodel0 = sf.model.ParameterizedModel(
    sf.model_zoo.NAMED_STRUCTURES['model4'],
    coords=dict(
        sample=mgen_discovery.sample.values,
        position=mgen_discovery.position.values,
        allele=mgen_discovery.allele.values,
        strain=range(1000),
    ),
    hyperparameters=dict(
        gamma_hyper=1e-10,
        rho_hyper=10.,
        rho_hyper2=10.,
        pi_hyper=0.1,
        pi_hyper2=0.1,
    ),
    device='cuda',
    dtype='float32',
).condition(**mgen_discovery.to_counts_and_totals())

est0, history0 = sf.estimation.estimate_parameters(
    pmodel0,
    device='cuda',
    dtype='float32',
    # # Should probably also include:
    # anneal_hyperparameters=anneal_hyperparameters,
    # annealiter=50000,
    # **estimation_kwargs,
)

pmodel1 = pmodel0.with_amended_coords(
    sample=mgen_full.sample.values,   # Update sample coords.
).condition(
  gamma=est0.genotype.values          # Condition on estimated genotypes.
  **mgen_full.to_counts_and_totals()  # Introduce full data.
)

est1, history1 = sf.estimation.estimate_parameters(
    pmodel1,
    device='cuda',
    dtype='float32',
    # # Should probably also include:
    # anneal_hyperparameters=anneal_hyperparameters,
    # annealiter=50000,
    # **estimation_kwargs,
)
est0.dump(DISCOVERY_OUTPATH)
est1.dump(FULL_FIT_OUTPATH)

If this approach proves useful to you, I would definitely consider adding to as a CLI subcommand. As of yet, however, I have not had much success with workflows like this. Instead, I've been more focused on scaling the traditional fitting up to datasets of this size.

I'd also be very interested in exploring the dataset you're working on (and particularly if it includes repeated sampling of the same subjects). If you would like to consider more formal collaboration do reach out. My email is byron.smith@gladstone.ucsf.edu.