bmschmidt / pubmed-explorer

Scrollership through 20m pubmed abstracts.
Other
25 stars 2 forks source link

Narration order in the beginning (until Covid) #17

Closed dkobak closed 1 year ago

dkobak commented 1 year ago

I am not sure what's the best narration order in the beginning, let's discuss. Currently it looks like that:

1. Intro
  1.1. Intro intro
  1.2. What we did method-wise: what the data source is, dataset size, methods (PubMedBERT/tSNE)
  1.3. Table of contents with links to jump to specific section of the narration
2. Barplot with years
3. Embedding colored by years
4. List of labels with possibility to highlight one label
5. Toggle between BERT and TF-IDF, some comments on kNN accuracy
6. Extracted sample sizes
7. Abstract length
8. Title length
9. Zoom-in to virology to show temporal trends
10. Move to Covid.

I feel that this is not ideal as it is jumping from one thing to the next a bit too much. I would prefer if the year barpot (which is very cool) is followed by the embedding colored by years, and then we zoom in to virology. So go to step 9 directly after step 3.

So what to do with the rest? We could show the list of labels BEFORE the years, because the label colors are anyway already there in the intro.

The TF-IDF toggle I would suggest to move all the way down, as some kind of Appendix. I think it's maybe overwhelming to show it in the beginning already, especially given that the rest of the narration won't use it. If we keep it in the front, then move before the Years.

Sample sizes -- here I am not sure, would consider removing them altogether, or moving to the Appendix too. If we keep them in the Intro, then I think they should be before the Years.

Abstract length and title length -- I would only keep one of them. Maybe title length? And also move them before the Years.

So here is my suggestion:

1. Intro
  1.1. Intro intro
  1.2. What we did method-wise: what the data source is, dataset size, methods (PubMedBERT/tSNE)
  1.3. Table of contents with links to jump to specific section of the narration
2. List of labels with possibility to highlight one label
3. Toggle between BERT and TF-IDF, some comments on kNN accuracy (could make sense after item 2) (? -- not sure)
4. Extracted sample sizes (? -- not sure)
5. Title length (? -- not sure)
6. Barplot with years
7. Embedding colored by years
8. Zoom-in to virology to show temporal trends
9. Move to Covid.
bmschmidt commented 1 year ago

I want to put in a histogram of abstract lengths as well, because the figure showing the peaks at 250, 300, etc. is one of my favorites in the paper. So that would argue for keeping abstract length.

bmschmidt commented 1 year ago

I think you're right that talking about time and then transitioning to COVID is the right order. It's possible that we should adda bit more macro-structure stuff early on--some (or all) of the nine-panel words that show up on different regions in the abstracts.

ritagonmar commented 1 year ago

I think I agree and I would keep abstract length rather than title length. Including the histogram as Ben suggests would also be cool. Even though right now the figure of the embedding colored by abstract length is based on the number of characters, and the histogram with the peaks is based on the number of words. I could potentially change the figure or provide you with the colormap values based on number of words instead of number of characters. About the TF-IDF toggle, I think I would move it to the end because, as Dmitry said, it is not used in the rest of the narration and could be confusing. I think adding the nine-panel words that show up on different regions earlier is also a good idea.

New suggestion:

1. Intro
  1.1. Intro intro
  1.2. What we did method-wise: what the data source is, dataset size, methods (PubMedBERT/tSNE)
  1.3. Table of contents with links to jump to specific section of the narration
2. List of labels with possibility to highlight one label
3. Nine-panel words that show up on different regions
4. Extracted sample sizes (? -- not sure)
5. Abstract length barplot
6. Abstract length
7. Barplot with years
8. Embedding colored by years
9. Zoom-in to virology to show temporal trends
10. Move to Covid.
...
N. Toggle between BERT and TF-IDF, some comments on kNN accuracy
dkobak commented 1 year ago

I made some edits now to the initial several slides. But haven't yet changed the order of the slides -- I can continue working on this tomorrow.

It's possible that we should adda bit more macro-structure stuff early on--some (or all) of the nine-panel words that show up on different regions in the abstracts.

We could do that, but I am also fine with leaving it out, at least for now. I think it's important to explain labels, and then we can jump directly to coloring by years.

I want to put in a histogram of abstract lengths as well, because the figure showing the peaks at 250, 300, etc. is one of my favorites in the paper. So that would argue for keeping abstract length.

That's funny :) I don't mind. Is it easy for you to insert this histogram? Could be a nice gimmick, but I think it's optional.

Title lengths and sample sizes I would remove.

bmschmidt commented 1 year ago

I just published your changes and made some minor edits (mostly to the table of contents portion) so things would still run. I'm going to avoid editing the big markdown file pending @dkobak's reorganization to avoid getting stuck in git merge resolution hell.

dkobak commented 1 year ago

This seems mostly done by now so I'm closing it, but @bmschmidt could you upload the current version of the scrollership to the github repo? You added the abstract length histogram to the website but it's not in the repo.