BRCAChallenge / literature-search

Crawl and annotate pubmed articles with variants
https://brcaexchange.org/
3 stars 6 forks source link

Add basic unit test #18

Closed rcurrie closed 5 years ago

rcurrie commented 5 years ago

Although a very fuzzy process add a pytest unit test for a handfull of papers/variants

rcurrie commented 5 years ago

stats jupyter notebook now runs automatically at the end of the dockerized crawl (or if you call with docker run stats) and outputs a much more comprehensive view of the results:

[[literature-search-crawl-report]]
= Literature Search Crawl Report

[[counts]]
== Counts

Stats for crawl on 2019-05-25T20:23:42 (in /crawl/pubs-date.txt)
Exported 3973 Papers and 10991 Variants
272 papers added since baseline on 2019-02-27T16:26:37-0800 of 3701 papers and 9886 variants

[[download]]
== Download

Attempted to download 17822 papers

Succeeded in downloading 14803 (83%)

Failure reasons:
invalidPdf             1015
noCrawlerSuccess        909                                                                                                                                         [77/1983]
httpError               323
noLicense               313
HighwirePdfNotValid     262
ovidMetaParseFailed      90
invalidHostname          56
pageErrorMessage         33
no_meta                   9
noOutlinkOrDoi            6
noSpringerLicense         5
tooManySupplFiles         1
HtmlParseError            1
BeautifulSoupError        1
Name: status, dtype: int64

[[find]]
== Find

13850 Papers didn't yield any variants
Tried to match against 24663 variants in BRCA Exchange
Found 451733 total mentions
Succesfully matched 193445 mentions

Match points distribution:
mean     5.370343
std      3.328845
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max     50.000000
Name: points, dtype: float64

[[founder-mutations]]
== Founder Mutations

All founder mutation papers succesfully downloaded

5382insC (chr17:g.43057062:T>TG) failed to match {'28122244'}
[[hgmd-for-chr13g.32363367cg]]
== HGMD for chr13:g.32363367:C>G

HGMD matching failures {'29394989', '29446198', '19043619'}
HGMD ranking and points:
pmid            points  #snips  HGMD
22962691        40      3       -
21990134        40      3       4
24323938        40      3       6
25146914        40      3       7
20215541        30      3       -
18424508        30      3       -
30039884        20      3       -
21990165        20      2       -
23586058        20      2       -
20690207        20      2       -
20522429        20      3       -
20507642        20      2       -
19471317        20      2       -
18607349        20      3       3
18451181        20      3       -
18273839        20      3       -
26913838        20      2       -
27060066        20      2       -
28324225        20      3       9
15026808        20      2       -
29884841        20      3       -
29988080        20      2       -
23108138        20      3       5
28339459        15      3       8
25003164        10      1       -
24817641        10      1       -
24122022        10      1       -
10464631        10      1       -
22753008        10      1       -
21735045        10      1       -
21638052        10      1       -
12145750        10      3       1
21309043        10      1       -
17924331        10      1       -
17899372        10      1       -
16792514        10      1       -
15744044        10      1       -
12845657        10      1       -
21344236        10      1       -
19332451        2       3       -
16825284        2       2       -
12915465        2       2       -

[[lovd]]
== LOVD

175 pmids and 1044 variants in normalized LOVD truth set

LOVD papers that we did not try and download:
18415037, 09585599, 08751436, 09497265, 09333265, 08531967, 08896551, 07939630, 09971877, 08942979, 09523200, 12900794, 09126734, 2010, 19150617, 09805131, 07545954

LOVD papers that we tried and failed to download:
17279547 (invalidPdf), 10506595 (HighwirePdfNotValid), 15533909 (HighwirePdfNotValid), 18680205 (invalidPdf), 17305420 (httpError), 12552570 (invalidPdf), 18493658 (noCrawlerSuccess), 16211554 (invalidPdf), 16969499 (noCrawlerSuccess), 12815598 (invalidPdf), 18693280 (invalidPdf), 12601471 (httpError), 11385711 (invalidPdf), 20020529 (invalidPdf), 16528612 (invalidPdf), 16619214 (invalidPdf), 20513136 (invalidPdf), 20054658 (invalidPdf), 16786532 (invalidPdf), 12955719 (invalidPdf), 10406662 (invalidPdf), 15300854 (invalidPdf), 12955716 (invalidPdf), 17657584 (invalidPdf), 18375895 (HighwirePdfNotValid), 19287957 (noCrawlerSuccess)

125 common pmids between LOVD and this crawl

Baseline confusion matrix against LOVD:
1138    2226
658     -1
Precision: 33.8% Recall: 63.4%

Current crawl confusion matrix against LOVD:
1137    2448
659     -1
Precision: 31.7% Recall: 63.3%
Done.