installation - Githubissues

kwchurch commented 2 years ago

Can you check the installation instructions? I know there are a lot of dependencies on other people's code, but it would be nice to do what we can to make sure the install is as easy as possible.

I think there are a few packages missing from the installation instructions

pip install pandas pip install jsonlines

Also, the latest version of jsonnet and sklearn-contrib-lightning didn't load for me

I used

pip install jsonnet==0.16.0 sklearn-contrib-lightning==0.6.2

sergeyf commented 2 years ago

Hello,

The installation instructions in the README do have pandas and jsonlines:

git clone https://github.com/allenai/scidocs.git
cd scidocs
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install

I am sorry these are a bit involved. Did you go through these and find that pandas and jsonlines were still not installed afterwards?

I just tried on a fresh environment and it seems to install properly on a Linux machine. That is, after installation, I can do import scidocs without errors.

sergeyf commented 2 years ago

Hello @kwchurch,

Just checking in - did the above work for you?

kwchurch commented 2 years ago

I was trying to avoid condas for reasons that aren't worth going into

it eventually worked for me, though it should be easy to distribute the data I wanted with fewer dependencies

I am concerned about this benchmark. I think it will make an old version of specter look better than it is, and that improvements you make to that system will look bad no matter how good they are.

On Fri, Sep 2, 2022 at 9:12 AM Sergey Feldman @.***> wrote:

Hello @kwchurch https://github.com/kwchurch,

Just checking in - did the above work for you?

— Reply to this email directly, view it on GitHub https://github.com/allenai/scidocs/issues/23#issuecomment-1235680139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKPJ472US3HNKVYNTKTV4IRN3ANCNFSM55TKZGMA . You are receiving this because you were mentioned.Message ID: @.***>

sergeyf commented 2 years ago

I'd be interested in hearing more details about your concerns. Why does an old version of SPECTER have the best performance and why do improvements look bad no matter how good they are? This paper beats SPECTER on SciDocs: https://arxiv.org/abs/2202.06671 by quite a bit:

kwchurch commented 2 years ago

I don't doubt that people will be able to find some improvements that will show up positively on this test, but here is an example of my concern

Here is an example of a query from scidocs (under the recomm subdirectory)

awk 'NR == 10' /work/k.church/githubs/scidocs/data/recomm/train.csv

2c1a34a1cdb565b2a8d31cc623db619f48e7d333,0fba42da6fc228d29519627513ca5353bf12209b,"7160953bc6f5177dba390f594584002652643c2c,ea8f7f04bbde043ce53deb17cab45a34027c2261,b753469dab0b58d23c01999317cdb537358af17d,0fba42da6fc228d29519627513ca5353bf12209b,2170a0a1030079e1af958\

ab492e793a1fe5ce6ac,4f310091227450881028cf14fd79a74170dc6860,b525dc660a88224d97f4ee0a9b469bd4c71fdff6,67b70f6bd528f0468e610ca05d5aa62fdd3a2706,ec821df87e9b94f65ff0b0f70533054c8af69125,f1140aace7d9c6722f8961c670b23c9b845f0851"

BTW, has you looked at the similar articles under https://pubmed.ncbi.nlm.nih.gov/12713677/? There is a tendency in Silicon Valley to move fast and break things, but the government (TREC/NIST/NLM) doesn't work that way, especially when it comes close to medine. That system was done by some pretty good people. That system is pretty old (predates deep nets), but it seems to have been done well by some pretty good people: https://www.semanticscholar.org/paper/Exploring-the-effectiveness-of-related-article-in-Lin-DiCuccio/56c51fdb49a79c2d6d470dc53f9595ec3382e906. I don't think NLM would release something like that without a bunch of testing.

I wanted to test a few systems so I did the following.

For a bunch of examples like the one above, I generated queries with 10 candidates in random order. The candidates include the gold standard from above (clicked) as well as some of the other choices, plus some choices from other sources. Then I graded each of the 10 choices with a rating between perfect and poor (using a 5 point system that we used for grading Bing search).

I wanted to do a bunch more judging like this, but I got distracted and didn't do that much. However, I came away with the impression that I needed to find a way to address a few concerns:

Recall is hard to estimate in this space because there are many great matches out there that are better than the best match we know of so far
Annotators lack required expertise: Technical docs are hard to read/score, since annotators cannot be expected to be experts in everything
Use cases matter: I'm not sure my use case fits well into the recomm model
The solution in pubmed is pretty amazing (need to test that one as well); Bing had a way to test candidates from Microsoft, as well as the competition. We need a solution that does that.

I will get back to testing, but here is a small experiment that I did based on stuff like the one line above. The hashcodes above can be turned into web links such as https://www.semanticscholar.org/paper/Nicotine-gum-induced-atrial-fibrillation.-Choragudi-Aronow/aa2a376a6db2f4f5744712624399c412f3c8d04a. If you replace the hash code in this URL with any of the hash codes above, you will get a link to the paper in the benchmark.

Here is what I gave the judges: 010Query: 29207353 19: Nicotine gum-induced atrial fibrillation. https://www.semanticscholar.org/paper/aa2a376a6db2f4f5744712624399c412f3c8d04a

86738494 0: Recognition and management of atrial fibrillation https://www.semanticscholar.org/paper/ae5eab6d0435640892a6c0e38b862aba87892b30
231575738 0: A 65-Year-Old Woman With Dyspnea After Atrial Fibrillation Ablation. https://www.semanticscholar.org/paper/324da801556c42cbabca9e57d25b67a6dc61c3c4
19487537 264: Evidence of atrial functional mitral regurgitation due to atrial fibrillation: reversal with arrhythmia control. https://www.semanticscholar.org/paper/f1140aace7d9c6722f8961c670b23c9b845f0851
206054027 59: Prevalencia de fibrilación auricular y uso de fármacos antitrombóticos en el paciente hipertenso ≥ 65 años. El registro FAPRES. https://www.semanticscholar.org/paper/df89889beef1703d5728006e4aeaaac1724ed3fb
20782969 24: Role of smoking in the recurrence of atrial arrhythmias after cardioversion. https://www.semanticscholar.org/paper/42d791c45b51484cd8f78d27aa3af852294d1d35
662029 5: Obstructive sleep apnoea induced atrial fibrillation https://www.semanticscholar.org/paper/865d7514cc2e807709b3ae5aa84c020ea8932c78
60441507 3: Ventricular Fibrillation in a Patient with Multi-Vessel Coronary Spasm Four Days after the Initiation of an Oral Beta-blocker https://www.semanticscholar.org/paper/91e0094e49be4e20c9bf0f4929e73f2129a30cce
5123230 36: Atrial fibrillation while chewing nicotine gum. https://www.semanticscholar.org/paper/568352141c0cdce58e4e19db373b1eded32fb291
35098069 78: American College of Cardiology/American Heart Association clinical practice guidelines: Part II: evolutionary changes in a continuous quality improvement project. https://www.semanticscholar.org/paper/0fba42da6fc228d29519627513ca5353bf12209b
41073257 31: Rhythm control in atrial fibrillation--one setback after another. https://www.semanticscholar.org/paper/4f310091227450881028cf14fd79a74170dc6860

These are my grades for the 10 candidates above. I'm using a system like we used at Bing to evaluate web search scrapes. A scrape is a set of query/URL pairs. Judges grade them on a 5-way scale from poor to perfect. After grading some of these, I am concerned about qualifications for annotators. I don't know that non-medical novices can do this task. But I think that a match to this query should talk about both Nicotine gum and atrial fibrillation. Choice 8 does that. The clicked choice (9) does not.

010 2 poor

010 3 poor (missing abstract)

010 4 poor (missing abstract)

010 5 good (missing abstract)

010 6 poor

010 7 good

010 8 excellent (PubMed publishes lists of similar articles; https://pubmed.ncbi.nlm.nih.gov/3945010/)

010 9 poor

010 10 poor

Here is the magic key decoder. The first doc is always the query. The clicked choice is always there, but could be anywhere else in the list. 0 means that the candidate came from the training file. 1 means that the candidate came from a system that I call specter (it is basically a nearest neighbor search based on specter embeddings from a bulk download. 2 means that the candidate came from a proposed system that I am working on.

29207353 query

86738494 2

231575738 2

19487537 0

206054027 0

20782969 1

662029 2

60441507 2

5123230 1

35098069 clicked

41073257 0

The larger concern is how to come up with scores for documents that have not been judged before. Bing was rich, so they could afford to pay judges to judge all candidates (more or less). So, if a system of interest (say one that is in the pipeline to be released) suggested a candidate that had not been judged, it would be sent to the judges and we would wait for them to do what they do. TREC is not that rich. They pay judges to judge candidates from systems during the competition, but after that, they assume that candidates that have not been judged are bad. However, they judge most candidates from many systems for a small number of queries, at least for a while, so it is pretty likely that they have seen most of the good matched. They judge much more than just one clicked answer for each query.

My concern is that most reasonable improvements over specter will find candidates that are better than the ones you have judged. And I don't know that any judge can do this task, because you need to know a lot about a lot of things (such as medicine) to do this task. This is very different from web queries, where most queries that we saw when I was working at Bing were relatively easy to judge. You did not need to be a medical expert to know that the answer to "google" is www.google.com

On Fri, Sep 2, 2022 at 9:47 AM Sergey Feldman @.***> wrote:

I'd be interested in hearing more details about your concerns. Why does an old version of SPECTER have the best performance and why do improvements look bad no matter how good they are? This paper beats SPECTER on SciDocs: https://arxiv.org/abs/2202.06671 by quite a bit:

[image: image] https://user-images.githubusercontent.com/1874668/188200798-d4da4f29-a229-4038-a3d8-a91c74b62995.png

— Reply to this email directly, view it on GitHub https://github.com/allenai/scidocs/issues/23#issuecomment-1235713264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKJEYANTI654UEU7HATV4IVQXANCNFSM55TKZGMA . You are receiving this because you were mentioned.Message ID: @.***>

sergeyf commented 2 years ago

Thank you for the explanation @kwchurch.

I agree that our SciDocs positive candidates are positive in a very specific sense. To mitigate this we included 4 senses (coviewed, coread, cited, cocited) . The fact that the negative ones are random should make this a relatively easy benchmark. With few exceptions, a good model should be able to get any of the positives and rank them above the random negatives for any of the sense we chose. This is of course subject to noise of browsing behavior and citation extraction, which are both real.

When I did an analysis of errors that SPECTER made, it looked like it hadn't gotten SciDocs perfect yet, so the ceiling of this particular easy benchmark wasn't reached. I think there is room to grow even in this toy sandbox, as subsequent work showed, as well as our own recent work that is yet to be published.

All that said, you are right that to grow substantially beyond SPECTER & friends, new systems will be able to find better candidates that we can provide. This is an interesting limitation.

We are working on a more realistic and bigger benchmark to replace SciDocs. Hope to arxiv it soon!

allenai / scidocs

installation #23