bmschmidt / pubmed-explorer

Scrollership through 20m pubmed abstracts.
Other
25 stars 2 forks source link

Problem with histogram of abstract length #42

Closed ritagonmar closed 1 year ago

ritagonmar commented 1 year ago

There seems to be something wrong with the histogram of abstract length. Papers with the shortest abstracts (black points) seem to actually have long abstracts. Also, papers belonging to the same bar have clearly different abstract lengths. I checked this by clicking on them and seeing their abstract in the PubMed webpage. I double checked on my side with a couple examples and the abstracts in the data I parsed are exactly the same as the ones displayed in the PubMed website. Therefore, I don't think it is a problem of the data I have. Could there be something wrong in the computation of abstract length on your side?

bmschmidt commented 1 year ago

Definitely could be my problem. I'll check.

bmschmidt commented 1 year ago

All the bad points I've seen so far are like this one where the misclassified abstract is not just long; it also has boldface/other formatting in it. Does that hold up with what you're seeing?

dkobak commented 1 year ago

Yes. It could be that the XML parsing did not work correctly whenever an abstract had HTML tags inside. Did you compute the abstract lengths based on the abstract texts that you parsed yourself? Or did you use abstracts parsed by Rita?

bmschmidt commented 1 year ago

Yeah, I computed based on 2023 abstracts parsed myself. I will look at what they are for these--it's possible it's an easy fix. Alternatively, if someone sends me a file that includes PMID and calculated length I can join that in instead.

ritagonmar commented 1 year ago

I can send you that. I also had a similar parsing problem in the past, where it failed every time there was some kind of formatting (italics, bold, etc.), so maybe the same happened on your side. The parsing code on the other repository (https://github.com/berenslab/pubmed-landscape) should work, in case you want to have a look at it.

dkobak commented 1 year ago

https://www.dropbox.com/s/tm2nu5tzb9pl9g7/pubmed_abstract_length.zip?dl=0

bmschmidt commented 1 year ago

Thanks for this file. I have updated and applied these changes in the new quadtiles being uploaded now.

dkobak commented 1 year ago

This seems to work fine now, but it seems that the displayed histogram extends more to the right than the domain: [0, 500] specified in the .md file. It definitely extends more to the right than in Rita's figure in the paper.

bmschmidt commented 1 year ago

Removed some data from the underlying file. Does this look right now?

image
dkobak commented 1 year ago

Yes, looks great! (I realized that domain:[0,500] was for the coloring only). Closing.