biocore / gemelli

Gemelli is a tool box for running Robust Aitchison PCA (RPCA), Joint Robust Aitchison PCA (Joint-RPCA), TEMPoral TEnsor Decomposition (TEMPTED), and Compositional Tensor Factorization (CTF) on sparse compositional omics datasets.
BSD 3-Clause "New" or "Revised" License
75 stars 18 forks source link

ValueError: No more features left. Check to make sure that the sample names between `sample-metadata` and `table` are consistent #34

Open johannesbjork opened 4 years ago

johannesbjork commented 4 years ago

Running the stand-alone version of gemelli on the example data used in the tutorial I get the error ValueError: No more features left. Check to make sure that the sample names betweensample-metadataandtableare consistent

As I'm not a Python person, I filter the example data in R.

mdat <- read.table("IBD-2538/data/metadata.tsv", sep='\t', header=T) # nrow(mdat) 516
ftbl <- biomformat::read_biom("IBD-2538/data/table.biom")
ftbl <- as(biomformat::biom_data(ftbl), "matrix") # ncol(ftbl) 470

mdat <- mdat %>% filter(sample_name %in% colnames(ftbl))
rownames(mdat) <- mdat $sample_name

ps <- phyloseq(otu_table(ftbl, taxa_are_rows=T),
                   sample_data(mdat))
# here I skip adding the taxonomy

ps <- metagMisc::phyloseq_filter_prevalence(ps, prev.trh=0.2, abund.trh=10, abund.type="total", threshold_condition="AND")

> ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 236 taxa and 318 samples ]
sample_data() Sample Data:       [ 318 samples by 128 sample variables ]

# Do we need to filter to only keep subjects with >=t timepoints?

biomformat::write_biom(biomformat::make_biom(t(otu_table(ps))), "table_filt.biom")
write.table(sample_data(ps), "metadata_filt.txt", sep="\t", quote=F)

Having made sure that samples match between the feature table and the metadata (plus filtered the our rare stuff), I run gemelli and get the following error

gemelli \
--in-biom table_filt.biom \
--sample-metadata-file metadata_filt.txt \
--individual-id-column 'host_subject_id' \
--state-column-1 'timepoint' \
--output-dir results      

Traceback (most recent call last):
  File "/Users/johannesbjork/python/miniconda3/bin/gemelli", line 8, in <module>
    sys.exit(standalone_ctf())
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/gemelli/scripts/_standalone_ctf.py", line 131, in standalone_ctf
    feature_metadata)
  File "/Users/johannesbjork/python/miniconda3/lib/python3.7/site-packages/gemelli/ctf.py", line 97, in ctf_helper
    raise ValueError(("No more features left.  Check to make sure that "
ValueError: No more features left.  Check to make sure that the sample names between `sample-metadata` and `table` are consistent
cameronmartino commented 4 years ago

Hi @johannesbjork,

Thank you for reporting this! The standalone CLI is the only tutorial I did not make and it seems that was an oversight on my part.

The error is occurring because the sample ids are labeled in the float format. So pandas are loading them as floats while biom is loading them as strings. This is causing the no sample ID matches between the table and metadata error seen above from gemelli.

I just fixed this in the tables here (fixed-IBD-example.zip) by adding a string ('s') to the sample names.

I will put in a PR for this fix and a standalone tutorial (issue #35).

The following command runs fine:

mkdir standalone-results
gemelli \
    --in-biom fixed-IBD-example/table.biom\
    --sample-metadata-file fixed-IBD-example/metadata.tsv \
    --individual-id-column 'host_subject_id' \
    --state-column-1 'timepoint' \
    --output-dir standalone-results

But to save runtime (since this is an example) you could also remove singletons with the --min-feature-count flag:

gemelli \
    --in-biom fixed-IBD-example/table.biom\
    --sample-metadata-file fixed-IBD-example/metadata.tsv \
    --individual-id-column 'host_subject_id' \
    --state-column-1 'timepoint' \
    --min-feature-count 1\
    --output-dir standalone-results

This also brings up a good point that a tutorial with R integration would be nice. I have added that to issue #35.

Thank you again for letting me know! and please let me know if this does not solve the problem for you.