bmschmidt / pubmed-explorer

Scrollership through 20m pubmed abstracts.
Other
25 stars 2 forks source link

Incorrect links #63

Closed bmschmidt closed 1 year ago

bmschmidt commented 1 year ago

Super interesting - however some of the dots seem to have the wrong hyperlink embedded - just for example the paper "Vision Is Adapted to the Natural Level of Blur Present in the Retinal Image" PLoS One (2011) twitter

I've noticed this a couple times myself... links are constructed directly from the pubmedids, so I'm a little confsued what's going on. Will check on the files, but maybe @ritagonmar you know whether PMIDs ever change or anything?

dkobak commented 1 year ago

This is odd. We should check what happened with this specific paper.

dkobak commented 1 year ago

I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.

bmschmidt commented 1 year ago

click_function: | window.open(https://pubmed.ncbi.nlm.nih.gov/${datum.pmid}/, '_blank')

That’s the link construction from pubmed.md—not much room for error. Unless pmids can have a leading zero, which I’m pretty sure they can’t, error is probably somewhere in the data pipeline. I’ll check some places later today. Would be good to find what the correct pubmed link is to see if the numbers look similar in some way.

On Sat, Apr 15, 2023 at 8:18 AM Dmitry Kobak @.***> wrote:

I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.

— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1509758039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZUUT5CDBHXG6DYSDBLXBKGYFANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>

dkobak commented 1 year ago

I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.

Correct Pubmed ID of that paper is https://pubmed.ncbi.nlm.nih.gov/24642611/

So 24642611 instead of 24642612. Very odd.

bmschmidt commented 1 year ago

Interesting

On Sat, Apr 15, 2023 at 9:20 AM Dmitry Kobak @.***> wrote:

I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.

Correct Pubmed ID of that paper is https://pubmed.ncbi.nlm.nih.gov/24642611/

So 24642611 instead of 24642612. Very odd.

— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1509822470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZVHPI5ZYMKWR4WF3GLXBKOCRANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>

bmschmidt commented 1 year ago

OK I have a working theory--it looks like PMIDs are getting cast to single-precision floating point precision somewhere, which above a certain number (16_000_000 IIRC) lose the ability the represent all integers. Instead they're getting rounded to the nearest even integer.

D SELECT * FROM parquet_scan("pubmed.parquet") WHERE pmid = 24642611  LIMIT 10;
┌──────────────────────┬──────────────────────┬────────────┬───────────┬───────────┬───┬────────────┬───────────────────┬─────────────────┬──────────────────┬─────────────────────┬─────────┐
│        title         │       journal        │     x      │     y     │  tfidf.x  │ … │    pmid    │ GenderFirstAuthor │ abstract_length │ date_granularity │        date         │   ix    │
│       varchar        │       varchar        │   float    │   float   │   float   │   │   float    │      varchar      │      float      │     varchar      │      timestamp      │ uint64  │
├──────────────────────┼──────────────────────┼────────────┼───────────┼───────────┼───┼────────────┼───────────────────┼─────────────────┼──────────────────┼─────────────────────┼─────────┤
│ A non-invasive met…  │ PloS one             │ -45.963207 │ -8.722849 │ 19.990572 │ … │ 24642612.0 │ male              │           228.0 │ year             │ 2014-11-15 00:00:00 │ 7873914 │
│ Disconnection of t…  │ Korean journal of …  │  176.43718 │ -54.35127 │  72.74314 │ … │ 24642612.0 │ female            │            47.0 │ month            │ 2014-03-13 00:00:00 │ 1676868 │
│ Tungsten distribut…  │ PloS one             │  -85.59946 │ 105.30419 │ -55.51529 │ … │ 24642612.0 │ unknown           │           208.0 │ year             │ 2014-04-28 00:00:00 │   55690 │
├──────────────────────┴──────────────────────┴────────────┴───────────┴───────────┴───┴────────────┴───────────────────┴─────────────────┴──────────────────┴─────────────────────┴─────────┤
│ 3 rows                                                                                                                                                               17 columns (11 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
ritagonmar commented 1 year ago

I checked in my data and in the versions of the data I shared with you (csv and parquet) the PMIDs seem to be fine. For that specific PMID from above there is only one entry -- the correct one. The PMIDs are strings in the parquet version and they get automatically transformed to int64 when saving as csv, but they are still correct. In the table you posted they seem to be floats. Maybe there is something weird happening when you read the data? Maybe they get converted into floats in a weird way?

bmschmidt commented 1 year ago

Yes sorry @ritagonmar I should have said--this is definitely happening on my end of the pipeline. I have rebuilt all the tiles locally typing PMID as a string--which is usually better for identifiers anyway--and will upload sometime today.

dkobak commented 1 year ago

I wanted to check if this is fixed by now, but cannot find those exact papers again :-/

bmschmidt commented 1 year ago

Not fixed yet, sorry! Just a long upload command I keep forgetting to make time for. I will run this morning once I get to the office.

On Wed, Apr 19, 2023 at 7:55 AM Dmitry Kobak @.***> wrote:

I wanted to check if this is fixed by now, but cannot find those exact papers again :-/

— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1514600742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZXAWYXSJFNIOAV4QXTXB7HEZANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>

bmschmidt commented 1 year ago

OK, I've uploaded new files and invalidated the cache. I've clicked on some high-numbered, odd numbers since and everything seems OK--going to close until any further reports.