Open kwchurch opened 2 years ago
Hello,
The installation instructions in the README do have pandas
and jsonlines
:
git clone https://github.com/allenai/scidocs.git
cd scidocs
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install
I am sorry these are a bit involved. Did you go through these and find that pandas
and jsonlines
were still not installed afterwards?
I just tried on a fresh environment and it seems to install properly on a Linux machine. That is, after installation, I can do import scidocs
without errors.
Hello @kwchurch,
Just checking in - did the above work for you?
I was trying to avoid condas for reasons that aren't worth going into
it eventually worked for me, though it should be easy to distribute the data I wanted with fewer dependencies
I am concerned about this benchmark. I think it will make an old version of specter look better than it is, and that improvements you make to that system will look bad no matter how good they are.
On Fri, Sep 2, 2022 at 9:12 AM Sergey Feldman @.***> wrote:
Hello @kwchurch https://github.com/kwchurch,
Just checking in - did the above work for you?
— Reply to this email directly, view it on GitHub https://github.com/allenai/scidocs/issues/23#issuecomment-1235680139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKPJ472US3HNKVYNTKTV4IRN3ANCNFSM55TKZGMA . You are receiving this because you were mentioned.Message ID: @.***>
I'd be interested in hearing more details about your concerns. Why does an old version of SPECTER have the best performance and why do improvements look bad no matter how good they are? This paper beats SPECTER on SciDocs: https://arxiv.org/abs/2202.06671 by quite a bit:
I don't doubt that people will be able to find some improvements that will show up positively on this test, but here is an example of my concern
Here is an example of a query from scidocs (under the recomm subdirectory)
awk 'NR == 10' /work/k.church/githubs/scidocs/data/recomm/train.csv
2c1a34a1cdb565b2a8d31cc623db619f48e7d333,0fba42da6fc228d29519627513ca5353bf12209b,"7160953bc6f5177dba390f594584002652643c2c,ea8f7f04bbde043ce53deb17cab45a34027c2261,b753469dab0b58d23c01999317cdb537358af17d,0fba42da6fc228d29519627513ca5353bf12209b,2170a0a1030079e1af958\
ab492e793a1fe5ce6ac,4f310091227450881028cf14fd79a74170dc6860,b525dc660a88224d97f4ee0a9b469bd4c71fdff6,67b70f6bd528f0468e610ca05d5aa62fdd3a2706,ec821df87e9b94f65ff0b0f70533054c8af69125,f1140aace7d9c6722f8961c670b23c9b845f0851"
BTW, has you looked at the similar articles under https://pubmed.ncbi.nlm.nih.gov/12713677/? There is a tendency in Silicon Valley to move fast and break things, but the government (TREC/NIST/NLM) doesn't work that way, especially when it comes close to medine. That system was done by some pretty good people. That system is pretty old (predates deep nets), but it seems to have been done well by some pretty good people: https://www.semanticscholar.org/paper/Exploring-the-effectiveness-of-related-article-in-Lin-DiCuccio/56c51fdb49a79c2d6d470dc53f9595ec3382e906. I don't think NLM would release something like that without a bunch of testing.
I wanted to test a few systems so I did the following.
For a bunch of examples like the one above, I generated queries with 10 candidates in random order. The candidates include the gold standard from above (clicked) as well as some of the other choices, plus some choices from other sources. Then I graded each of the 10 choices with a rating between perfect and poor (using a 5 point system that we used for grading Bing search).
I wanted to do a bunch more judging like this, but I got distracted and didn't do that much. However, I came away with the impression that I needed to find a way to address a few concerns:
I will get back to testing, but here is a small experiment that I did based on stuff like the one line above. The hashcodes above can be turned into web links such as https://www.semanticscholar.org/paper/Nicotine-gum-induced-atrial-fibrillation.-Choragudi-Aronow/aa2a376a6db2f4f5744712624399c412f3c8d04a. If you replace the hash code in this URL with any of the hash codes above, you will get a link to the paper in the benchmark.
Here is what I gave the judges: 010Query: 29207353 19: Nicotine gum-induced atrial fibrillation. https://www.semanticscholar.org/paper/aa2a376a6db2f4f5744712624399c412f3c8d04a
These are my grades for the 10 candidates above. I'm using a system like we used at Bing to evaluate web search scrapes. A scrape is a set of query/URL pairs. Judges grade them on a 5-way scale from poor to perfect. After grading some of these, I am concerned about qualifications for annotators. I don't know that non-medical novices can do this task. But I think that a match to this query should talk about both Nicotine gum and atrial fibrillation. Choice 8 does that. The clicked choice (9) does not.
010 2 poor
010 3 poor (missing abstract)
010 4 poor (missing abstract)
010 5 good (missing abstract)
010 6 poor
010 7 good
010 8 excellent (PubMed publishes lists of similar articles; https://pubmed.ncbi.nlm.nih.gov/3945010/)
010 9 poor
010 10 poor
Here is the magic key decoder. The first doc is always the query. The clicked choice is always there, but could be anywhere else in the list. 0 means that the candidate came from the training file. 1 means that the candidate came from a system that I call specter (it is basically a nearest neighbor search based on specter embeddings from a bulk download. 2 means that the candidate came from a proposed system that I am working on.
29207353 query
86738494 2
231575738 2
19487537 0
206054027 0
20782969 1
662029 2
60441507 2
5123230 1
35098069 clicked
41073257 0
The larger concern is how to come up with scores for documents that have not been judged before. Bing was rich, so they could afford to pay judges to judge all candidates (more or less). So, if a system of interest (say one that is in the pipeline to be released) suggested a candidate that had not been judged, it would be sent to the judges and we would wait for them to do what they do. TREC is not that rich. They pay judges to judge candidates from systems during the competition, but after that, they assume that candidates that have not been judged are bad. However, they judge most candidates from many systems for a small number of queries, at least for a while, so it is pretty likely that they have seen most of the good matched. They judge much more than just one clicked answer for each query.
My concern is that most reasonable improvements over specter will find candidates that are better than the ones you have judged. And I don't know that any judge can do this task, because you need to know a lot about a lot of things (such as medicine) to do this task. This is very different from web queries, where most queries that we saw when I was working at Bing were relatively easy to judge. You did not need to be a medical expert to know that the answer to "google" is www.google.com
On Fri, Sep 2, 2022 at 9:47 AM Sergey Feldman @.***> wrote:
I'd be interested in hearing more details about your concerns. Why does an old version of SPECTER have the best performance and why do improvements look bad no matter how good they are? This paper beats SPECTER on SciDocs: https://arxiv.org/abs/2202.06671 by quite a bit:
[image: image] https://user-images.githubusercontent.com/1874668/188200798-d4da4f29-a229-4038-a3d8-a91c74b62995.png
— Reply to this email directly, view it on GitHub https://github.com/allenai/scidocs/issues/23#issuecomment-1235713264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEKUDKJEYANTI654UEU7HATV4IVQXANCNFSM55TKZGMA . You are receiving this because you were mentioned.Message ID: @.***>
Thank you for the explanation @kwchurch.
I agree that our SciDocs positive candidates are positive in a very specific sense. To mitigate this we included 4 senses (coviewed, coread, cited, cocited) . The fact that the negative ones are random should make this a relatively easy benchmark. With few exceptions, a good model should be able to get any of the positives and rank them above the random negatives for any of the sense we chose. This is of course subject to noise of browsing behavior and citation extraction, which are both real.
When I did an analysis of errors that SPECTER made, it looked like it hadn't gotten SciDocs perfect yet, so the ceiling of this particular easy benchmark wasn't reached. I think there is room to grow even in this toy sandbox, as subsequent work showed, as well as our own recent work that is yet to be published.
All that said, you are right that to grow substantially beyond SPECTER & friends, new systems will be able to find better candidates that we can provide. This is an interesting limitation.
We are working on a more realistic and bigger benchmark to replace SciDocs. Hope to arxiv it soon!
Can you check the installation instructions? I know there are a lot of dependencies on other people's code, but it would be nice to do what we can to make sure the install is as easy as possible.
I think there are a few packages missing from the installation instructions
pip install pandas pip install jsonlines
Also, the latest version of jsonnet and sklearn-contrib-lightning didn't load for me
I used
pip install jsonnet==0.16.0 sklearn-contrib-lightning==0.6.2