advice for using catvae in a new contexts

Hello! I saw your recent preprint, "Scalable estimation of microbial co-occurrence networks with Variational Autoencoders" and I'm hopeful your method may solve my issues, but I wanted to touch base to see if you think this method is appropriate for my use cases/scale of problem I'm hoping to address.

Use cases

Detecting contaminant pairs that co-occur in microbial genomes: We built a tool to detect and remove contamination from genomes and metagenome assembled genomes. We're running this tool in the ~350k genomes in GTDBrs202. We detect contamination at the order level in about 15% of genomes. We want to use order-level lineages detected in each sample to determine if any contaminants co-occur more than would be expected by chance.
- Github: https://github.com/taylorreiter/2020-charcoal-paper/tree/2021
- Demo notebook for small problem: https://github.com/taylorreiter/2020-charcoal-paper/blob/2021/notebooks/20220224_pairs_of_contaminants_more_than_expected_by_chance_ibd_exp.ipynb
Detecting contaminant pairs that co-occur in "isolate" RNAseq data sets: We're creating a compendia of bacterial and archaeal isolate RNA seq data. There are ~60k isolate data sets on the SRA. As part of this, we're looking for contamination in these isolates, and we generally find that there is some (usually 2-10 species detected in each sample). We want to know if any species co-occur (e.g. is Faecalibacterium prausnitzii likely to be contaminated with it's friend Roseburia inulinivorans?
- Github: https://github.com/greenelab/2022-microberna/

Questions

Is this method appropriate for these use cases? If not, have you encountered something else that might work?
What would be the training data? In both cases, I'm identifying lineages present in a sample using GTDB rs202 as the reference.
Can this approach scale to 60k and 350k samples?

I'd appreciate any insights/feedback you'd be willing to give!

Hi @taylorreiter, thank you for reaching out - it sounds like a cool problem.

Could you clarify the input data type? Are you dealing with a count table of genomes x genes, genomes x kmers, samples x microbes? If samples x microbes, how are you doing the taxonomic profiling?
This is an unsupervised method (its basically PCA tailored to count data). So the training data is really a table of counts (with biom primarily being used to efficiently store sparse representations of these counts).
It largely depends on the column dimensionality. If the column dimensionality is big (i.e. ~100k), then we'd need bigger guns, like multi-node GPU support. If the column dimensionality is <10k, then this should be doable with the code base as is. There are also minor tweaks we can introduce to speed things up (like disable grassmannian flag).

Could you clarify the input data type? Are you dealing with a count table of genomes x genes, genomes x kmers, samples x microbes? If samples x microbes, how are you doing the taxonomic profiling?

The count table would be samples x microbes, where sample will either be a contaminated genome or a contaminated RNA seq run. I'm doing the taxonomic profiling using sourmash gather (preprint here) against the gtdb rs202 database. Below, I've summarized up to the order lineage (from genome or species) and turned abundances into presence-absence information.

This is flexible to a certain extent...I could keep the abundances, summarize to a higher lineage, or not summarize lineages at all.

                                                                  sample1   sample2 sample3 
d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales   1   1   0   
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales         1   0   1
d__Bacteria;p__Firmicutes_A;c__Clostridia_A;o__Christensenellales   1   1   0   
d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Eubacteriales          1   0   0

This is an unsupervised method (its basically PCA tailored to count data). So the training data is really a table of counts (with biom primarily being used to efficiently store sparse representations of these counts).

Ok great, thank you! I haven't worked with the biom format before, but I'm sure I could convert my data into that format pretty easily.

What would be the best data to train on? Sorry I know that's a naive question. But I could have counts from all of the GTDB rs202 genomes in e.g. 10k SRA metagenomes, would that be reasonable to use? It's not really the same application (meatgenome composition vs. contamination of samples), but I would expect many of the same species-species relationship to exist across all of these data types.

It largely depends on the column dimensionality. If the column dimensionality is big (i.e. ~100k), then we'd need bigger guns, like multi-node GPU support. If the column dimensionality is <10k, then this should be doable with the code base as is. There are also minor tweaks we can introduce to speed things up (like disable grassmannian flag).

Ok thanks. By dimensionality, does that mean rows x columns? If I stick with order-level contamination detection in genomes, I would have 1,312 orders (rows) x 258,406 genomes (columns). For the RNA-seq problem, I'm guessing I'll have 3,000 species (rows) x 50,000 RNA samples (columns).

That would be running everything all at once, which while I think that would be preferable, I could break it down so that I'm only looking within one order or one species at a time for a given set of samples. In that case, the columns number would shrink substantially and would max out around 10k or 15k. That's less desirable as it carves the data up in a way that I think could mask some correlations, but is preferable to not being able to look for these correaltions at all :)

Right. When I refer to dimensionality, I typically refer to samples (rows) x microbes (columns) -- where the goal is to infer microbe-microbe correlations (aka column-column correlations).

In order for the VAEs to work, there needs to be more microbes (columns) than samples (rows) -- otherwise we need fancy regularization (aka stronger biological priors), which requires redesigning the core algorithm. Our approach to avoid this was just to bump up the number of samples. Otherwise all bets are off, you'd need big machines to fit your giant covariance matrix in memory, on top of strong priors; your estimates most certainly will be biased and it won't be clear what exactly could be learned.

If your first case is to learn pairwise interactions between 258k x 258k microbes, that will probably not be tractable. But if you are trying to estimate 1.3k x 1.3k order interactions, that's easy and will take a few days to run on a GPU.

Similarity, if your second case is is to pairwise interactions between 3k x 3k microbes, that's even easier.

flatironinstitute / catvae

advice for using catvae in a new contexts #74

Use cases

Questions