metadata_to_dataframe order matters

guillaume-gricourt commented 3 years ago

Hi, When you have this: good.txt it'ok When the order of metadata is different : bad.txt You have : ValueError: 2 columns passed, passed data had 6 columns Maybe, taking account the maximum of value before parsing them ? biom-format v2.1.10

wasade commented 3 years ago

Hi @guillaume-gricourt, that parser was designed to support classic OTU tables from QIIME1 where the lineages were assured to be balanced with placeholders for unidentified names. TSVs are not BIOM-Format, and are unstructured which, which creates a wide range of edge cases.

As a work around, you could parse counts without metadata, parse the taxonomy separately and add it in with biom.Table.add_metadata?

guillaume-gricourt commented 3 years ago

Yeah it's a good workaround. I create biom files from tsv to load data into Phyloseq package. Also, this file is my entrypoint to perform others analysis. From now on, when I'll create this biom file I'll check the order of metadata on my tsv file. As you can create this kind of biom file, it seems to me, it's a feature of interest to implement ?

wasade commented 3 years ago

I'd greatly welcome a pull request to resolve this feature request, otherwise I'm not sure when I'll be able to get to it. A possible work around is below.

$ biom convert -i bad.txt -o bad.biom --to-hdf5
$ python
Python 3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:19:23)
[GCC Clang 10.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import biom
>>> df = pd.read_csv('bad.txt', sep='\t')
>>> df.set_index('#OTU ID', inplace=True)
>>> t = biom.load_table('bad.biom')
>>> formatted = {k: {'taxonomy': v.split(';')} for k, v in df['taxonomy'].items()}
>>> t.add_metadata(formatted, axis='observation')
>>> with biom.util.biom_open('okay.biom', 'w') as fp:
...   t.to_hdf5(fp, 'converted')
...

biocore / biom-format

metadata_to_dataframe order matters #855