MFEh2o / db

**Contains the main issue tracker for the MFE DB!** Functions for interacting with the MFE database, in script format. (See also MFEUtilities, which is an R package that includes many/most of the same functions).
1 stars 0 forks source link

BENTHIC_INVERTS need controlled vocabulary on taxonomic names #31

Closed ctsolomon closed 3 years ago

ctsolomon commented 5 years ago

There are a bunch of unintentional near-duplicates in the taxonomic names columns (everything from subclass through taxa). For instance, the subclass column has "bivalvia" and "bivalvia " (the latter with a trailing space); orderSample has "trichoptera" and "tricoptera"; etc.

In addition to fixing these errors in existing data we should make sure that data entry tool for benthic inverts (once it is written) helps prevent more errors like this.

ctsolomon commented 5 years ago

One error in these names that would be easy to miss: ephemeroptera_pupae (Ephemeroptera don't pupate, so not sure what is going on with the couple samples in EL, WL with this name in the taxa column. These rows indicate that the bug was a caenid.

ctsolomon commented 5 years ago

In subfamily column, a few rows have "chironomidae" listed but this is not a subfamily. Maybe a typo for "chironominae"? Figure out how to resolve.

kaijagahm commented 4 years ago

So far: have gone through and corrected things that were obvious spelling errors (e.g. changed "tricoptera" to "trichoptera"; removed extraneous white spaces, etc.)

Remaining issues:

  1. ephemeroptera_pupae (as described by Chris above, Ephemeroptera don't pupate). There are two inverts: WL_NW_20120510_0000_sediment_1_zoobenthos.Sample.20110714_276 and WL_NW_20120510_0000_sediment_1_zoobenthos.Sample.20110714_277. The project is 3: Long Lake survey. Genus is listed as caenidae_pupa and taxa is listed as ephemeroptera_pupae. No picker initials are given.

  2. We need to standardize name endings, like pupa/pupae, larva/larvae. Why would we ever want plural? Shouldn't they all be singular?

  3. Some names are repeated in multiple levels: e.g. bivalvia and hirundinea are listed in both subclass and orderSample. Should I be googling to figure out which level these belong at, and correcting them accordingly, or is there a different way? [Note that both of these are weird in the OTU table: bivalvia is a class, but the otu table doesn't have a column for class, so "bivalvia" is the value in the otu column, with nothing for the other levels. Is this right? Should we have a class column? Hirudinea is a subclass, but again, there is no subclass. But instead of leaving the fields blank, OTU has "hirudinea_subclass" for orderTax. This seems inconsistent--shouldn't have taxon level info in both the column names and in the values of the columns.]

  4. As Chris noted, "chironomidae" is not a subfamily, though it is listed as such in BENTHIC_INVERTS. Maybe should be "chironominae"? Note that for all of these samples, the family is indeed listed as "chironomidae." But maybe we need to check the samples to make sure it's correct to correct the subfamily?

  5. "corduliidae" and "phryganeidae" are listed in the genus column, but they are not genera. For "corduliidae", seems like just an accidental spillover into the genus column. For "phryganeidae", note that the taxa column says "phryganea", which IS a genus. Maybe that was also the value that should be in the genus column? Check to make sure there's not anything more complicated happening here before correcting these.

  6. Should probably get someone who knows the spellings to do a general spell check on all of these, or should pinpoint a definitive source (the OTU table?), get that cleaned up, and then compare against it.

  7. Taxon name sometimes contains life stage (e.g. pupa) and sometimes doesn't. Why don't we have a separate column for life stage? If none is listed, is there a default (e.g. adult), or is it unknown? [this comes up as well in FISH_DIETS].

  8. Several taxa are listed as "[x]or[y]". Is this a desirable way of identifying? Or would it be better to revert to the lowest taxonomic level at which we can have a sure ID?

  9. Someone who knows bugs should look through these names to check for other errors like 1., which I would not have picked up on because I don't know the bugs.

  10. BENTHIC_INVERTS$subclass has both "nematode" and "nematoda", but neither of these are subclasses. Should probably have phylum == "nematoda", but there is no phylum column. What to do?

  11. See also #34 for a specific problem with some contradictory names.

  12. See also relevant problems in #30, which I'm linking to because they also concern standardization of taxonomic names.

ctsolomon commented 4 years ago

1: Let's change the ID on these bugs to NA (i.e. NA in all of the taxonomic columns)

2: I put together a controlled vocabulary for bug taxonomic names. Posting to Slack, still a couple things for me and Kaija to work through. This controlled vocabulary doesn't use larvae/larvae or pupa/pupae at the end of any names; instead, we add a "pupa" column (takes values 0, 1).

3: Fixed by my new controlled vocabulary. In brief: I propose that we rename the "taxa" column "otu". This column gives the name of the organism at the lowest level to which we identified it. That same name will also show up in a higher-level taxonomic column, like genus or family. Highest taxonomic levels are order (orderSample) and supergroup (a catch-all for things like class, sub-class, anything higher than order).

4: I corrected subfamilies in my controlled vocabulary, assuming that the bug was correctly identified to genus and filling in subfamily and tribe from there.

5: Controlled vocabulary fixes this; I've made notes of where I think we might still need to check on things.

6: I think spellings are now all correct in controlled vocabulary document.

7: This is now fixed by the controlled vocabulary (for BENTHIC_INVERTS, not yet for fish).

8: Yeah, the "x_or_y" identification is a little ugly, obviously. I have retained it in a couple places where I think the genera are hard to distinguish, figuring that knowing that it's one of those two is information and we might as well keep it - we can always drop back to family level for analysis if we want.

9: I have now looked through all the unique combinations of bug names, and have dealt with any issues in the controlled vocabulary document. That is: unique(d[,c("subclass","orderSample","family","subfamily","tribe","genus","taxa")]) where d is the complete BENTHIC_INVERTS table from the MFEdb_20190612 version of the database. If more recent versions of the database have additional bugs we should check those bugs too.

10: Fixed in the controlled vocabulary document.

11 and 12: I'll address those in the separate GitHub issues that they relate to.

kaijagahm commented 4 years ago
  1. In your corrected names sheet, you kept these as is (left ephemeroptera and caenidae) and indicated a 1 in the pupa column. Why do you think it's better to make these NA rather than just remove the 1 from the pupa column, since these bugs don't pupate? [Edit 9/24: @ctsolomon looks like I forgot to tag you in this. Thoughts? This is in reference to initial question 1 above.]

  2. Do you think we should have a "larva" column as well as a "pupa" column, OR just have a "stage" column that can take many different values? [Edit 9/24: Chris and I decided that we don't need to specify larva because they're almost always larvae].

The rest look good. Currently working on resolving any remaining problems with the taxon names.

kaijagahm commented 4 years ago

Working on fixing the remaining problems in @ctsolomon's controlled vocab: here are sheets with the invert ID, the comment for what is wrong, and my comments in attempts to resolve them. Was able to fix many by looking back at the photos. For some, I can't find photos. Some have photos but need someone with bug experience to help.

https://docs.google.com/spreadsheets/d/1s4k2LEedvI2lC_FE6ALVEBo5GPS8DC7agUQXOofB1Wk/edit?usp=sharing

kaijagahm commented 4 years ago

Finished looking for/looking at photos. Called in Amaryllis to help figure out the remaining ID's. For the ones where I couldn't find a photo, we may leave them as unidentified and make a note about it.

May try harder to figure out individual bugs from the group photos, if I hear back from Randi about file naming.

kaijagahm commented 4 years ago

@ctsolomon I've finished almost all of the updates to the benthic inverts taxonomy, just waiting on a few final ID's. Now working on reconciling the OTU table with BENTHIC_INVERTS. Here are some questions that have come up in doing that:

a. I noticed that in BENTHIC_INVERTS, the new "supergroup" column is almost always empty, unless we need to use it. But of course, all lower taxonomic classes should have a corresponding supergroup. Do you want me to fill those in even when we have information at lower ranks, or not? If not, then should I remove supergroup information where it does exist when there are lower ranks?

b. OTU has no supergroup, subfamily, or tribe columns. Since there are some taxa in BENTHIC_INVERTS where our most specific ID for them is a supergroup, subfamily or tribe, should I add those columns to OTU to allow for that?

c. How do we want to handle life stages in OTU? This is more complicated than adding a pupa column because OTU also contains fish and other things, so there are YOY, larvae, pupae, etc.

Life stages are currently tacked onto the end of the taxa names in the otu column. That's a problem, though, because sometimes there's no entry for just the plan taxon name, and names that come from BENTHIC_INVERTS will never have a life stage on them, so it could make it hard to join the tables. For example, there's no OTU entry for "coleoptera", although we have "coleoptera_adult", "coleoptera_aquatic", "coleoptera_larvae", and "coleoptera_terrestrial". Similarly, we have "gerridae_adult" but no entry for "gerridae", and we have "leptoceridae_pupa" but no entry for "leptoceridae".

I think the most logical thing would be to add a life stage column to OTU. Would that mess anything up workflow-wise, as far as we can tell? What about for fish? The alternative, I guess, would be to leave those life stage rows but add non-life-stage rows as well, so we'd still have a plain "coleoptera" row on top of the existing ones.

d. Question raised by the coleoptera example above: do we also need to distinguish between habitat types, e.g. aquatic vs. terrestrial coleoptera? I assume this was relevant to fish diets, but it seems a little clunky to store that information in OTU.

ctsolomon commented 4 years ago

Re indicating a 1 in pupa column for those two Ephemeroptera - mistake on my part, you are correct that we should not indicate 1 there.

(a): According to my notes above (10 July), supergroup is intended to be a catch-all column for taxonomic rankings higher than order. I think we should use this column only when an organism isn't identified at some lower level - so in cases like "Nematoda". If supergroup column is NA but there are lower level identifications, I would leave supergroup as NA. If there are lower identifications AND something in supergroup...yes maybe remove supergroup, but can you give me a sense of the cases in which that occurs?

(b): Yes

(c): The tricky bit here I think is the interface with the FISH_DIETS table. Taxa can appear in multiple life stages in fish diets - so for example fish can eat larval/aquatic or adult/terrestrial dragonflies. Whatever we do to the OTU table needs to also work for merging with the FISH_DIETS table. That could involve adding a life stage column to FISH_DIETS. Let's talk with everyone at Tues meeting on this one.

(d) Agree this is a little clunky. On the other hand, the aquatic and terrestrial forms are, in many ways, like totally different organisms in the context of fish diets. Again, let's talk Tuesday.

kaijagahm commented 4 years ago

@ctsolomon a. Makes sense, that's what I thought. I just checked, and it looks like the only place where this actually occurs is with bivalves: we have "bivalvia" in supergroup even when there are lower classifications available like "sphaeriidae". It's easy enough to just remove "bivalvia" there.

kaijagahm commented 4 years ago

Fixed the ephemeroptera pupae mistake.

a) Fixed bivalve supergroups. b) Added columns to OTU c) Not making any other changes to OTU now e.g. adding life stage column. Leaving as is. We can look into making this change later if we decide to. d) Same as c; leaving as is.

Finished with this except for the few ID's I'm still waiting on Amaryllis for.

kaijagahm commented 4 years ago

Checked in with her again, but went ahead and just left those ID's as unknown with comments. Fixed up a few more mistakes, added a few more rows to OTU. Wrote out BENTHIC_INVERTS and OTU as working csv files so that Chris and Amaryllis can test them before I update the database.

kaijagahm commented 3 years ago

Going to do this as a new update, 3.5.3, since it involves the BENTHIC_INVERTS table, which is already touched in 3.5.2.

kaijagahm commented 3 years ago

Clearing out my google drive, so here's an excel version of the interactive google sheets document that Chris and I used to create the new taxonomy: BI_taxonomy_questions.xlsx