biocore / emp

Code repository of the Earth Microbiome Project.
http://www.earthmicrobiome.org
BSD 3-Clause "New" or "Revised" License
154 stars 68 forks source link

Moving emp20k_metadata_refine notebook from my repo to biocore #64

Closed cuttlefishh closed 7 years ago

cuttlefishh commented 8 years ago

@antgonza @josenavas @sjanssen2 @ekopylova @tkosciol -- Would y'all like to have a look at this PR that I presented a couple weeks ago? It would be great to get it into biocore, along with the other PRs in this repo.

antgonza commented 8 years ago

Some comments:

cuttlefishh commented 8 years ago

Thanks for the comments! All should be easy fixes.

On Aug 16, 2016, at 6:52 AM, Antonio Gonzalez notifications@github.com wrote:

Some comments:

datetime.datetime.today() twice, is that really necessary? Replace for only 1 call could you replace math.isnan for numpy.isnan? This will remove the import of math cold you merge cell 2 and 3? Both of them are setting options. in cell 7, how certain are we that values can't have multiple zeros, I mean: 10.0, 10.00, or 10.0000000 can cell 12 and 13 be merged? can 14 and 15 be merged and remove first line of 15? cell 16, can you explain the magic number 4? can 22-25 be a single cell? what about (single cell) 102 and 103? can we reset the count of the cells (remove the jump 20s -> 100s)? all the metadata fixes (geography, env biom, etc) should be done in the source files, qiita(?) remove the unused cells: Additional metadata columns (from Fauzi), Hierarchical taxonomy lookup (from Jon), or at least add that they are placeholders. Also move to the end, not the middle of the doc fix plot labels so they are actually readable. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/biocore/emp/pull/64#issuecomment-240108048, or mute the thread https://github.com/notifications/unsubscribe-auth/AFa_ZteUhyJMjHI2vfyDJwAg5q5Og87Fks5qgcCYgaJpZM4JgfT5.

cuttlefishh commented 8 years ago

@antgonza Only thing I haven't integrated is pushing the metadata fixes back to the source files. For major issues, a quick fix is to send Gail the corrected data and she can manually fix those sample info files. Longer term solution is to integrate directly with Qiita. For this we need to define which are true errors that need to be fixed with no maintenance of the old metadata, versus suggested changes where we might want both the old column and new column.

Another factor is, eventually we want a metadata input solution that checks/prevents many of these issues before they make it into Qiita. So now these are one-time fixes, and going forward we will catch such mistakes when the study is uploaded.

For the purposes of this notebook and the EMP paper, we are working with the mapping files from Qiita as they are, and merging to a single quality-controlled ("refined") mapping file. The code does generate per-study mapping files, but they contain just the columns from the merged file, not all the columns in the original mapping files.

antgonza commented 8 years ago

Thanks! Note that you didn't push the changes.

Anyway, I think that if you generate the fixed files per study we can really quickly update Qiita, re-download and integrate them with this script. Then later we can think of how to integrate this code with Qiita. This will mean that specific per-study fixes will not be necessary anymore and we can only leave the general ones. What do you think?

cuttlefishh commented 8 years ago

I still need to do the painstaking work of updating the taxids to be correct. Right now I can look up the taxonomy but it's incorrect in some cases due to incorrect taxids. Once I integrate that I'll push those commits.

That sounds like a plan. Happy to use my code however it helps get the mapping files more accurate.

cuttlefishh commented 8 years ago

OK, the updates are pushed now, including a bunch of tedious host_taxid updates. Should be in good shape now.

cuttlefishh commented 7 years ago

I'm closing this PR because notebook is now merged to master as emp/ipynb/analysis/01-metadata-processing/metadata_refine.ipynb. Thanks!