Closed bmschmidt closed 1 year ago
This is odd. We should check what happened with this specific paper.
I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014)
links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.
click_function: |
window.open(https://pubmed.ncbi.nlm.nih.gov/${datum.pmid}/
, '_blank')
That’s the link construction from pubmed.md—not much room for error. Unless pmids can have a leading zero, which I’m pretty sure they can’t, error is probably somewhere in the data pipeline. I’ll check some places later today. Would be good to find what the correct pubmed link is to see if the numbers look similar in some way.
On Sat, Apr 15, 2023 at 8:18 AM Dmitry Kobak @.***> wrote:
I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.
— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1509758039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZUUT5CDBHXG6DYSDBLXBKGYFANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>
I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.
Correct Pubmed ID of that paper is https://pubmed.ncbi.nlm.nih.gov/24642611/
So 24642611
instead of 24642612
. Very odd.
Interesting
On Sat, Apr 15, 2023 at 9:20 AM Dmitry Kobak @.***> wrote:
I found another example, also from PLoS ONE: A non-invasive method of quantifying pancreatic volume in mice using micro-MRI. PloS one (2014) links to https://pubmed.ncbi.nlm.nih.gov/24642612/ which is a different paper.
Correct Pubmed ID of that paper is https://pubmed.ncbi.nlm.nih.gov/24642611/
So 24642611 instead of 24642612. Very odd.
— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1509822470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZVHPI5ZYMKWR4WF3GLXBKOCRANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>
OK I have a working theory--it looks like PMIDs are getting cast to single-precision floating point precision somewhere, which above a certain number (16_000_000 IIRC) lose the ability the represent all integers. Instead they're getting rounded to the nearest even integer.
D SELECT * FROM parquet_scan("pubmed.parquet") WHERE pmid = 24642611 LIMIT 10;
┌──────────────────────┬──────────────────────┬────────────┬───────────┬───────────┬───┬────────────┬───────────────────┬─────────────────┬──────────────────┬─────────────────────┬─────────┐
│ title │ journal │ x │ y │ tfidf.x │ … │ pmid │ GenderFirstAuthor │ abstract_length │ date_granularity │ date │ ix │
│ varchar │ varchar │ float │ float │ float │ │ float │ varchar │ float │ varchar │ timestamp │ uint64 │
├──────────────────────┼──────────────────────┼────────────┼───────────┼───────────┼───┼────────────┼───────────────────┼─────────────────┼──────────────────┼─────────────────────┼─────────┤
│ A non-invasive met… │ PloS one │ -45.963207 │ -8.722849 │ 19.990572 │ … │ 24642612.0 │ male │ 228.0 │ year │ 2014-11-15 00:00:00 │ 7873914 │
│ Disconnection of t… │ Korean journal of … │ 176.43718 │ -54.35127 │ 72.74314 │ … │ 24642612.0 │ female │ 47.0 │ month │ 2014-03-13 00:00:00 │ 1676868 │
│ Tungsten distribut… │ PloS one │ -85.59946 │ 105.30419 │ -55.51529 │ … │ 24642612.0 │ unknown │ 208.0 │ year │ 2014-04-28 00:00:00 │ 55690 │
├──────────────────────┴──────────────────────┴────────────┴───────────┴───────────┴───┴────────────┴───────────────────┴─────────────────┴──────────────────┴─────────────────────┴─────────┤
│ 3 rows 17 columns (11 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
I checked in my data and in the versions of the data I shared with you (csv and parquet) the PMIDs seem to be fine. For that specific PMID from above there is only one entry -- the correct one. The PMIDs are strings in the parquet version and they get automatically transformed to int64 when saving as csv, but they are still correct. In the table you posted they seem to be floats. Maybe there is something weird happening when you read the data? Maybe they get converted into floats in a weird way?
Yes sorry @ritagonmar I should have said--this is definitely happening on my end of the pipeline. I have rebuilt all the tiles locally typing PMID
as a string--which is usually better for identifiers anyway--and will upload sometime today.
I wanted to check if this is fixed by now, but cannot find those exact papers again :-/
Not fixed yet, sorry! Just a long upload command I keep forgetting to make time for. I will run this morning once I get to the office.
On Wed, Apr 19, 2023 at 7:55 AM Dmitry Kobak @.***> wrote:
I wanted to check if this is fixed by now, but cannot find those exact papers again :-/
— Reply to this email directly, view it on GitHub https://github.com/bmschmidt/pubmed-explorer/issues/63#issuecomment-1514600742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPFZXAWYXSJFNIOAV4QXTXB7HEZANCNFSM6AAAAAAW7K3ETA . You are receiving this because you authored the thread.Message ID: @.***>
OK, I've uploaded new files and invalidated the cache. I've clicked on some high-numbered, odd numbers since and everything seems OK--going to close until any further reports.
I've noticed this a couple times myself... links are constructed directly from the pubmedids, so I'm a little confsued what's going on. Will check on the files, but maybe @ritagonmar you know whether PMIDs ever change or anything?