Feedback desired for the current TNRS bulk-mapping tool

jimallman commented 5 years ago

I'm looking for feedback on the current version of the bulk TNRS mapping tool on devtree. In particular, I'm interested in these questions. Most deal with terminology -- you can see these terms used in the tool's Help tab.

Is "nameset" a good name for the saved-session file? I also considered "session" and "project".
I use "name" here to refer to each entry or listed name, i.e. its unmapped label, adjusted label (if any), and mapping details as described above. I chose "name" because we're not in the context of a particular tree. Would "label" or "OTU" be more appropriate?
I assume that this is not an appropriate place to let the user add new taxa to OTT, esp. since they're probably working with unpublished data. Correct?
Do we need more or different metadata? I'm planning to add the OTT version most recently used for mapping.
Support for the local filesystem is limited by security considerations. Most importantly, we can't simply Save over an existing nameset file. Instead, most (all?) browsers will try to use the suggested name, but will increment or otherwise avoid overwriting an existing file. I'm hoping this is not a major point of confusion.
Does this need more hand-holding, esp. when you first arrive to an empty nameset? I've tried to make the Help tab pretty chatty, and there are some prompts in the empty list area to get them started.
What do you think of the link text and location on the main curation homepage? Does this make sense? Is there a better description that would draw you in?
Output files inside the ZIP archive: I'm currently saving one TSV file with an initial header row, so a simple file looks like this:
```
ORIGINAL LABEL    OTT TAXON NAME  OTT TAXON ID
homo sapiens  Homo sapiens    770315
hominids  Hominidae   770311
bacteria  Bacteria    844192
```
Un-mapped names are omitted. Does this make sense? Are there other formats you'd prefer?
I had planned to save all input files (lists of names added using 'Add names...') in the nameset archive, but now it seems like useless clutter. What do you think? Is this important to establish provenance?

snacktavish commented 5 years ago

So cool! It is looking great. I think it will be very popular.

Is "nameset" a good name for the saved-session file? I also considered "session" and "project". What about 'mappings'? or 'matches'

I use "name" here to refer to each entry or listed name, i.e. its unmapped label, adjusted label (if any), and mapping details as described above. I chose "name" because we're not in the context of a particular tree. Would "label" or "OTU" be more appropriate? I think name works, and label would be fine too.

I assume that this is not an appropriate place to let the user add new taxa to OTT, esp. since they're probably working with unpublished data. Correct? I agree! To add a taxon you gotta give us your tree so we know what to do with it

Do we need more or different metadata? I'm planning to add the OTT version most recently used for mapping. Ott version sounds good, I can't think of anything else off hand.

Support for the local filesystem is limited by security considerations. Most importantly, we can't simply Save over an existing nameset file. Instead, most (all?) browsers will try to use the suggested name, but will increment or otherwise avoid overwriting an existing file. I'm hoping this is not a major point of confusion. Seems fine to me

Does this need more hand-holding, esp. when you first arrive to an empty nameset? I've tried to make the Help tab pretty chatty, and there are some prompts in the empty list area to get them started. It seems like add names should be first, or bigger, because that is what we usually expect people to click, right? Load namest will be rare (unless I'm foregtting something) I clicked load nameset first, to upload my names. Maybe the add names button should say "upload names"?

What do you think of the link text and location on the main curation homepage? Does this make sense? Is there a better description that would draw you in? Maybe "name mapping" or "name matching" would easier to understand? I'm not sure how much folks use the term TNRS. Maybe instead of being a tab at the top it could be a button next to "add studies" that says: "map names without adding a study"... ok that's def too long. But maybe something like that? That it is on the same level as adding a study rather than being a whole tab. It took me a moment to find it!

Un-mapped names are omitted. Does this make sense? Are there other formats you'd prefer? I'd keep them in, I feel like it makes things more clear. They would still go into the json as 'original labels' either way, right?

I had planned to save all input files (lists of names added using 'Add names...') in the nameset archive, but now it seems like useless clutter. What do you think? Is this important to establish provenance? I'd keep all the names in the output tsv, and then it is all in one place.

Technical notes: On firefox on ubuntu the popup windows don't close automatically, e.g. after the names have been uploaded, or after the zip has been downloaded, you need to manually close them with the little x.

When you unzip the zipped output files, the main.json file isn't in the folder, it just goes into the dir where you unzipped.

Does the file really have to have a txt or tsv extension? I tried to upload a plaintext file I had lazily saved without an extension.

Other ideas:

It might be fun to add a "download pruned subtree for these taxa" button. Or automatically show it, and the links to the studies that traverse it.

We could, if we felt like it, return gbif and ncbi ids from the taxonomy too. We could also make it easier to search on gbif or ncbi ids in the name mapping. I don't think this is urgent, but someone was asking about it on twitter today, and it has come up before. But maybe those folks can just use the taxon_info api call

kcranston commented 5 years ago

Exciting! I can take a look at this tomorrow and leave more notes.

mtholder commented 5 years ago

I agree with @snacktavish's answers. On the saving issue: it might be nice to have a title-for-the-nameset and # of editing sessions as a string. So:

Load the names
have an optional "title" edit box
when saving, if a title has been added, you can use title-editing-session-#.zip as the default name.
then when loading a nameset, the session # could be incremented. I'm just thinking that it might help folks just to be able to see "3rd time this session was loaded" If they find that they have multiple .zip files on their machine. I hope that makes sense.

mtholder commented 5 years ago

I wonder if it might be easier to just have a "load" button. and then text that explains that if the file selected is a .zip, it'll is expected to be saved session. If not it is not a zip file, we expect a list of names (and your current instructions "Your list should be a plain-text file where each line is a name to be mapped. Its text encoding should be utf-8."). The idea would be to just have one load button with a extension-based behavior (or ignore the extension, but use a cascade of: (1) see if it upload checks out as a saved session, otherwise (2) see if you can interpret it as a list of names).

jimallman commented 5 years ago

Which of these formats would you prefer for listing multiple "upstream" taxonomic sources in TSV? There's a tension here between legibility of the raw file vs. ease of splitting+parsing the ids if we also need to trim whitespace.

ORIGINAL LABEL     TAXON NAME       OTT TAXON ID        TAXONOMIC SOURCES
bacteria       Bacteria     844192              silva:A16379/#1,ncbi:2,worms:6,gbif:3,irmng:13
bacteria       Bacteria     844192              silva:A16379/#1, ncbi:2, worms:6, gbif:3, irmng:13
bacteria       Bacteria     844192              silva:A16379/#1|ncbi:2|worms:6|gbif:3|irmng:13
bacteria       Bacteria     844192              silva:A16379/#1 | ncbi:2 | worms:6 | gbif:3 | irmng:13

Also, this format still privileges the OTT id over the others. Should we also (or only) combine this into the TAXONOMIC SOURCES column, something like ott:844192?

jar398 commented 5 years ago

I would like semicolons or vertical bars instead of commas please. Semicolons are easier to read.

(so that when the file gets converted to CSV no quote marks are needed) I guess this goes for the flags too, should those be included.

I think putting the primary source in its own column might be a good idea, to help people understand that the others are merely alignments. As in, one column for TAXONOMIC SOURCE and another for OTHER SOURCES (or ALIGNMENTS or something like that).

On Apr 5, 2019, at 1:19 PM, Jim Allman notifications@github.com wrote:

Which of these formats would you prefer for listing multiple "upstream" taxonomic sources in TSV? There's a tension here between legibility of the raw file vs. ease of splitting+parsing the ids if we also need to trim whitespace.

ORIGINAL LABEL TAXON NAME OTT TAXON ID TAXONOMIC SOURCES bacteria Bacteria 844192 silva:A16379/#1,ncbi:2,worms:6,gbif:3,irmng:13 bacteria Bacteria 844192 silva:A16379/#1, ncbi:2, worms:6, gbif:3, irmng:13 bacteria Bacteria 844192 silva:A16379/#1|ncbi:2|worms:6|gbif:3|irmng:13 bacteria Bacteria 844192 silva:A16379/#1 | ncbi:2 | worms:6 | gbif:3 | irmng:13

Also, this format still privileges the OTT id over the others. Should we also (or only) combine this into the TAXONOMIC SOURCES column, something like ott:844192?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jimallman commented 5 years ago

I would like semicolons or vertical bars instead of commas please. Semicolons are easier to read.

Can do! My main concern is of course that we choose a delimiter that will never appear in the identifiers, but legibility is a close second. Are you OK with a little whitespace? e.g. silva:A16379/#1; ncbi:2; worms:6; gbif:3; irmng:13

I think putting the primary source in its own column might be a good idea

I like this, but I don't think we currently distinguish the "primary" source vs. other alignments in the tnrs/match_names method. Unless it's always the first entry in the tax_sources list?

jar398 commented 5 years ago

whitespace, hmm. as long as it's always there I guess. value.split('; ') Another reason to avoid vertical bar is that NCBI uses it as a column separator.

In the OTT taxonomy.tsv file, the primary or original source is always first. I don't know what tnrs/match_names does but it seems likely it keeps the order. In the example you give, the order is the taxonomy.tsv order. Would be necessary to check the source code maybe (or change it) to make sure.

jimallman commented 5 years ago

Thanks for your detailed response! I'm chasing this logic down now.

jar398 commented 5 years ago

Another thing, and obviously I don't know the context and you can ignore this, over the years I have come to prefer CSV to TSV. CSV is understood by more programs than TSV, the .csv extension has wider currency than .tsv, and so on. There's a SO question about this and I dislike most of the answers, but the answer I select is "I think that generally csv, are supported more often than the tsv format."

One reason I think is pretty important is that for the non-computer-literate, TSV might as well be binary, because that population has no clue what a 'tab character' is. This may be true even if they know a little perl or python. They will have a much better chance working with CSV in the tools they use (HTML form fields, github on-web page editing, etc) than TSV. This was our experience with the first OTT patch system. The open tree postdocs were totally clueless when it came to tabs.

The column alignment when displaying TSV is nice when it works but it rarely works; variation in field contents widths leads to weird ragged displays that can be pretty hard to read.

jimallman commented 3 years ago

Adding notes here as I resume work on bulk TNRS:

We've received reports of failing ZIP archives uploaded from MacOS Finder. These were created with the Mac's built-in Archive Utility (by ctrl-clicking a folder and choosing "Compress..."). Possible causes include
- a known issue when creating very large (>4GB) archives on Mac; this seems unlikely, but maybe conceivable for a very large nameset..?
- MacOS Finder (Archive Utility) creates archives with hidden __MACOSX/ directories that can confound other ZIP utilities. The best workaround is probably to generate ZIP archives using command-line tools as described here.

LunaSare commented 3 years ago

Hi @jimallman, The nameset tool is awesome! I have been trying CSV and TSV files to upload pre-mapped names after our meeting the other day. I tried different typos and namesets. This is my feedback:

adjusted labels are not included in the nameset --> Could we include an adjusted label value on the pre-mapped nameset to populate the Modified for mapping column? Alternatively, no value should appear on that column if it was not provided by the curator.
it does not like it when CSV or TSV nameset has a single line, it gives a message saying that no matching labels were found --> allow for single lined nameset files?
As you pointed out on our meeting, OTT taxon name can be anything. It will still link to the taxon page associated to the given OTT id, but it is potentially confusing for users. Also, would downloading the tree with OTT taxon names show the name from the nameset file? or will it show the one associated to the OTT id? --> Is it possible to infer the OTT name from the given OTT id to populate the Mapped to taxon column on curation?
names that are mapped on the curation site are automatically preferred above pre-mapped names --> Should we give users a chance to review them and choose the one they prefer? Add curator notes?
when the nameset contains one or all names that have already been successfully matched with the pre-mapping tool, it says "only N were successfully matched" when there are N names already pre-mapped, or "no matches were found" when all names have already successfully pre-mapped--> This gives the idea that mapping has failed. Change the message to "N have already been successfully matched"?
when the OTT id is not correct/ does not exist/ has a typo, the tools says that it was successfully mapped --> Is it possible to check that the OTT id exists and then show a fail message if it does not exist?
TSV namesets do not seem to work well

OpenTreeOfLife / feedback

Feedback desired for the current TNRS bulk-mapping tool #440