caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

centralize metadata parsing #60

Closed gregcaporaso closed 8 years ago

gregcaporaso commented 8 years ago

It would be good to have a parse_sample_metadata function, and then use that instead of calls like sample_metadata = pd.read_table(mapping_fp, index_col=0). This call will be problematic in some cases that we've run into (heres how it should be done), and we want to be able to fix that in one place rather than have the file be potentially parsed differently in different parts of the code base.

wasade commented 8 years ago

Just curious, what's the need for drop and append in that example?

gregcaporaso commented 8 years ago

Just so it's all in one place (@wdwvt1 was also asking about this). The way we're doing it in QIIME 2 is:

>>> import io
>>> map_f = io.StringIO("""#SampleID\tSomething
0123\ta
0001\tb""")
>>> df = pd.read_csv(map_f, sep='\t', dtype=object)
>>> df.set_index(df.columns[0], drop=True, append=False)
>>> df
          Something
#SampleID
0123              a
0001              b

If you don't do the two-step setting of the index, the sample ids will be interpreted as strings (dtype=object won't be applied to the index column), so in this example the leading zeros would be removed.

The drop=True and append=False are not strictly necessary here - they're defaults, but pandas has changed their defaults in backward incompatible ways, so @jairideout added those in to protect us against that in the future (thanks for the explanation of these two parameters @jairideout!).