Closed ritagonmar closed 1 year ago
Definitely could be my problem. I'll check.
All the bad points I've seen so far are like this one where the misclassified abstract is not just long; it also has boldface/other formatting in it. Does that hold up with what you're seeing?
Yes. It could be that the XML parsing did not work correctly whenever an abstract had HTML tags inside. Did you compute the abstract lengths based on the abstract texts that you parsed yourself? Or did you use abstracts parsed by Rita?
Yeah, I computed based on 2023 abstracts parsed myself. I will look at what they are for these--it's possible it's an easy fix. Alternatively, if someone sends me a file that includes PMID and calculated length I can join that in instead.
I can send you that. I also had a similar parsing problem in the past, where it failed every time there was some kind of formatting (italics, bold, etc.), so maybe the same happened on your side. The parsing code on the other repository (https://github.com/berenslab/pubmed-landscape) should work, in case you want to have a look at it.
Thanks for this file. I have updated and applied these changes in the new quadtiles being uploaded now.
This seems to work fine now, but it seems that the displayed histogram extends more to the right than the domain: [0, 500]
specified in the .md file. It definitely extends more to the right than in Rita's figure in the paper.
Removed some data from the underlying file. Does this look right now?
Yes, looks great! (I realized that domain:[0,500]
was for the coloring only). Closing.
There seems to be something wrong with the histogram of abstract length. Papers with the shortest abstracts (black points) seem to actually have long abstracts. Also, papers belonging to the same bar have clearly different abstract lengths. I checked this by clicking on them and seeing their abstract in the PubMed webpage. I double checked on my side with a couple examples and the abstracts in the data I parsed are exactly the same as the ones displayed in the PubMed website. Therefore, I don't think it is a problem of the data I have. Could there be something wrong in the computation of abstract length on your side?