OpenTreeOfLife / feedback

No code -- just an issue tracker for general feedback (sent here via GitHub's issues API)
1 stars 0 forks source link

Feedback desired for the current TNRS bulk-mapping tool #440

Open jimallman opened 5 years ago

jimallman commented 5 years ago

I'm looking for feedback on the current version of the bulk TNRS mapping tool on devtree. In particular, I'm interested in these questions. Most deal with terminology -- you can see these terms used in the tool's Help tab.

  1. Is "nameset" a good name for the saved-session file? I also considered "session" and "project".

  2. I use "name" here to refer to each entry or listed name, i.e. its unmapped label, adjusted label (if any), and mapping details as described above. I chose "name" because we're not in the context of a particular tree. Would "label" or "OTU" be more appropriate?

  3. I assume that this is not an appropriate place to let the user add new taxa to OTT, esp. since they're probably working with unpublished data. Correct?

  4. Do we need more or different metadata? I'm planning to add the OTT version most recently used for mapping.

  5. Support for the local filesystem is limited by security considerations. Most importantly, we can't simply Save over an existing nameset file. Instead, most (all?) browsers will try to use the suggested name, but will increment or otherwise avoid overwriting an existing file. I'm hoping this is not a major point of confusion.

  6. Does this need more hand-holding, esp. when you first arrive to an empty nameset? I've tried to make the Help tab pretty chatty, and there are some prompts in the empty list area to get them started.

  7. What do you think of the link text and location on the main curation homepage? Does this make sense? Is there a better description that would draw you in?

  8. Output files inside the ZIP archive: I'm currently saving one TSV file with an initial header row, so a simple file looks like this:

    ORIGINAL LABEL    OTT TAXON NAME  OTT TAXON ID
    homo sapiens  Homo sapiens    770315
    hominids  Hominidae   770311
    bacteria  Bacteria    844192

    Un-mapped names are omitted. Does this make sense? Are there other formats you'd prefer?

  9. I had planned to save all input files (lists of names added using 'Add names...') in the nameset archive, but now it seems like useless clutter. What do you think? Is this important to establish provenance?

snacktavish commented 5 years ago

So cool! It is looking great. I think it will be very popular.

Is "nameset" a good name for the saved-session file? I also considered "session" and "project". What about 'mappings'? or 'matches'

I use "name" here to refer to each entry or listed name, i.e. its unmapped label, adjusted label (if any), and mapping details as described above. I chose "name" because we're not in the context of a particular tree. Would "label" or "OTU" be more appropriate? I think name works, and label would be fine too.

I assume that this is not an appropriate place to let the user add new taxa to OTT, esp. since they're probably working with unpublished data. Correct? I agree! To add a taxon you gotta give us your tree so we know what to do with it

Do we need more or different metadata? I'm planning to add the OTT version most recently used for mapping. Ott version sounds good, I can't think of anything else off hand.

Support for the local filesystem is limited by security considerations. Most importantly, we can't simply Save over an existing nameset file. Instead, most (all?) browsers will try to use the suggested name, but will increment or otherwise avoid overwriting an existing file. I'm hoping this is not a major point of confusion. Seems fine to me

Does this need more hand-holding, esp. when you first arrive to an empty nameset? I've tried to make the Help tab pretty chatty, and there are some prompts in the empty list area to get them started. It seems like add names should be first, or bigger, because that is what we usually expect people to click, right? Load namest will be rare (unless I'm foregtting something) I clicked load nameset first, to upload my names. Maybe the add names button should say "upload names"?

What do you think of the link text and location on the main curation homepage? Does this make sense? Is there a better description that would draw you in? Maybe "name mapping" or "name matching" would easier to understand? I'm not sure how much folks use the term TNRS. Maybe instead of being a tab at the top it could be a button next to "add studies" that says: "map names without adding a study"... ok that's def too long. But maybe something like that? That it is on the same level as adding a study rather than being a whole tab. It took me a moment to find it!

Un-mapped names are omitted. Does this make sense? Are there other formats you'd prefer? I'd keep them in, I feel like it makes things more clear. They would still go into the json as 'original labels' either way, right?

I had planned to save all input files (lists of names added using 'Add names...') in the nameset archive, but now it seems like useless clutter. What do you think? Is this important to establish provenance? I'd keep all the names in the output tsv, and then it is all in one place.


Technical notes: On firefox on ubuntu the popup windows don't close automatically, e.g. after the names have been uploaded, or after the zip has been downloaded, you need to manually close them with the little x.

When you unzip the zipped output files, the main.json file isn't in the folder, it just goes into the dir where you unzipped.

Does the file really have to have a txt or tsv extension? I tried to upload a plaintext file I had lazily saved without an extension.

Other ideas:

It might be fun to add a "download pruned subtree for these taxa" button. Or automatically show it, and the links to the studies that traverse it.

We could, if we felt like it, return gbif and ncbi ids from the taxonomy too. We could also make it easier to search on gbif or ncbi ids in the name mapping. I don't think this is urgent, but someone was asking about it on twitter today, and it has come up before. But maybe those folks can just use the taxon_info api call

kcranston commented 5 years ago

Exciting! I can take a look at this tomorrow and leave more notes.

mtholder commented 5 years ago

I agree with @snacktavish's answers. On the saving issue: it might be nice to have a title-for-the-nameset and # of editing sessions as a string. So:

  1. Load the names
  2. have an optional "title" edit box
  3. when saving, if a title has been added, you can use title-editing-session-#.zip as the default name.
  4. then when loading a nameset, the session # could be incremented. I'm just thinking that it might help folks just to be able to see "3rd time this session was loaded" If they find that they have multiple .zip files on their machine. I hope that makes sense.
mtholder commented 5 years ago

I wonder if it might be easier to just have a "load" button. and then text that explains that if the file selected is a .zip, it'll is expected to be saved session. If not it is not a zip file, we expect a list of names (and your current instructions "Your list should be a plain-text file where each line is a name to be mapped. Its text encoding should be utf-8."). The idea would be to just have one load button with a extension-based behavior (or ignore the extension, but use a cascade of: (1) see if it upload checks out as a saved session, otherwise (2) see if you can interpret it as a list of names).

jimallman commented 5 years ago

Which of these formats would you prefer for listing multiple "upstream" taxonomic sources in TSV? There's a tension here between legibility of the raw file vs. ease of splitting+parsing the ids if we also need to trim whitespace.

ORIGINAL LABEL     TAXON NAME       OTT TAXON ID        TAXONOMIC SOURCES
bacteria       Bacteria     844192              silva:A16379/#1,ncbi:2,worms:6,gbif:3,irmng:13
bacteria       Bacteria     844192              silva:A16379/#1, ncbi:2, worms:6, gbif:3, irmng:13
bacteria       Bacteria     844192              silva:A16379/#1|ncbi:2|worms:6|gbif:3|irmng:13
bacteria       Bacteria     844192              silva:A16379/#1 | ncbi:2 | worms:6 | gbif:3 | irmng:13

Also, this format still privileges the OTT id over the others. Should we also (or only) combine this into the TAXONOMIC SOURCES column, something like ott:844192?

jar398 commented 5 years ago

I would like semicolons or vertical bars instead of commas please. Semicolons are easier to read.

(so that when the file gets converted to CSV no quote marks are needed) I guess this goes for the flags too, should those be included.

I think putting the primary source in its own column might be a good idea, to help people understand that the others are merely alignments. As in, one column for TAXONOMIC SOURCE and another for OTHER SOURCES (or ALIGNMENTS or something like that).

On Apr 5, 2019, at 1:19 PM, Jim Allman notifications@github.com wrote:

Which of these formats would you prefer for listing multiple "upstream" taxonomic sources in TSV? There's a tension here between legibility of the raw file vs. ease of splitting+parsing the ids if we also need to trim whitespace.

ORIGINAL LABEL TAXON NAME OTT TAXON ID TAXONOMIC SOURCES bacteria Bacteria 844192 silva:A16379/#1,ncbi:2,worms:6,gbif:3,irmng:13 bacteria Bacteria 844192 silva:A16379/#1, ncbi:2, worms:6, gbif:3, irmng:13 bacteria Bacteria 844192 silva:A16379/#1|ncbi:2|worms:6|gbif:3|irmng:13 bacteria Bacteria 844192 silva:A16379/#1 | ncbi:2 | worms:6 | gbif:3 | irmng:13

Also, this format still privileges the OTT id over the others. Should we also (or only) combine this into the TAXONOMIC SOURCES column, something like ott:844192?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jimallman commented 5 years ago

I would like semicolons or vertical bars instead of commas please. Semicolons are easier to read.

Can do! My main concern is of course that we choose a delimiter that will never appear in the identifiers, but legibility is a close second. Are you OK with a little whitespace? e.g. silva:A16379/#1; ncbi:2; worms:6; gbif:3; irmng:13

I think putting the primary source in its own column might be a good idea

I like this, but I don't think we currently distinguish the "primary" source vs. other alignments in the tnrs/match_names method. Unless it's always the first entry in the tax_sources list?

jar398 commented 5 years ago

whitespace, hmm. as long as it's always there I guess. value.split('; ') Another reason to avoid vertical bar is that NCBI uses it as a column separator.

In the OTT taxonomy.tsv file, the primary or original source is always first. I don't know what tnrs/match_names does but it seems likely it keeps the order. In the example you give, the order is the taxonomy.tsv order. Would be necessary to check the source code maybe (or change it) to make sure.

jimallman commented 5 years ago

Thanks for your detailed response! I'm chasing this logic down now.

jar398 commented 5 years ago

Another thing, and obviously I don't know the context and you can ignore this, over the years I have come to prefer CSV to TSV. CSV is understood by more programs than TSV, the .csv extension has wider currency than .tsv, and so on. There's a SO question about this and I dislike most of the answers, but the answer I select is "I think that generally csv, are supported more often than the tsv format."

One reason I think is pretty important is that for the non-computer-literate, TSV might as well be binary, because that population has no clue what a 'tab character' is. This may be true even if they know a little perl or python. They will have a much better chance working with CSV in the tools they use (HTML form fields, github on-web page editing, etc) than TSV. This was our experience with the first OTT patch system. The open tree postdocs were totally clueless when it came to tabs.

The column alignment when displaying TSV is nice when it works but it rarely works; variation in field contents widths leads to weird ragged displays that can be pretty hard to read.

jimallman commented 3 years ago

Adding notes here as I resume work on bulk TNRS:

LunaSare commented 3 years ago

Hi @jimallman, The nameset tool is awesome! I have been trying CSV and TSV files to upload pre-mapped names after our meeting the other day. I tried different typos and namesets. This is my feedback: