KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

PCA behavior #367

Open chasemc opened 2 weeks ago

chasemc commented 2 weeks ago

Would it be okay to switch:

https://github.com/KwanLab/Autometa/blob/0d9028cf7bad20d6e28667aaba9d3889a15ace09/autometa/common/kmers.py#L601-L607

to adapt to a lower pca dimension when there aren't enough contigs/kmers

    if n_components > pca_dimensions and pca_dimensions != 0:
        if n_samples < pca_dimensions:
            logging.warning(f"n_samples ({n_samples}) is less than pca_dimensions ({pca_dimensions}), lowering pca_dimensions to {min(n_samples, n_components)} .")            
            pca_dimensions = min(n_samples, n_components)
        logger.debug(
            f"Performing decomposition with PCA (seed {seed}): {n_components} to {pca_dimensions} dims"
        )
        X = PCA(n_components=pca_dimensions, random_state=random_state).fit_transform(X)
        n_samples, n_components = X.shape
chasemc commented 2 weeks ago

To be clear -> as written this would only happen in the instance that there are less "samples" (contigs) than there are PCA dimensions

jason-c-kwan commented 2 weeks ago

What would the point be of doing PCA on a dataset with less than 50 contigs before some other dimension reduction technique? I think before making this change there should be some data gathered on whether it is useful or makes a difference.

chasemc commented 2 weeks ago

The main reason is so a minimal dataset that doesn't take forever doesn't fail when testing the workflows.