Closed rcurrie closed 5 years ago
stats jupyter notebook now runs automatically at the end of the dockerized crawl (or if you call with docker run stats) and outputs a much more comprehensive view of the results:
[[literature-search-crawl-report]]
= Literature Search Crawl Report
[[counts]]
== Counts
Stats for crawl on 2019-05-25T20:23:42 (in /crawl/pubs-date.txt)
Exported 3973 Papers and 10991 Variants
272 papers added since baseline on 2019-02-27T16:26:37-0800 of 3701 papers and 9886 variants
[[download]]
== Download
Attempted to download 17822 papers
Succeeded in downloading 14803 (83%)
Failure reasons:
invalidPdf 1015
noCrawlerSuccess 909 [77/1983]
httpError 323
noLicense 313
HighwirePdfNotValid 262
ovidMetaParseFailed 90
invalidHostname 56
pageErrorMessage 33
no_meta 9
noOutlinkOrDoi 6
noSpringerLicense 5
tooManySupplFiles 1
HtmlParseError 1
BeautifulSoupError 1
Name: status, dtype: int64
[[find]]
== Find
13850 Papers didn't yield any variants
Tried to match against 24663 variants in BRCA Exchange
Found 451733 total mentions
Succesfully matched 193445 mentions
Match points distribution:
mean 5.370343
std 3.328845
min 1.000000
25% 3.000000
50% 5.000000
75% 7.000000
max 50.000000
Name: points, dtype: float64
[[founder-mutations]]
== Founder Mutations
All founder mutation papers succesfully downloaded
5382insC (chr17:g.43057062:T>TG) failed to match {'28122244'}
[[hgmd-for-chr13g.32363367cg]]
== HGMD for chr13:g.32363367:C>G
HGMD matching failures {'29394989', '29446198', '19043619'}
HGMD ranking and points:
pmid points #snips HGMD
22962691 40 3 -
21990134 40 3 4
24323938 40 3 6
25146914 40 3 7
20215541 30 3 -
18424508 30 3 -
30039884 20 3 -
21990165 20 2 -
23586058 20 2 -
20690207 20 2 -
20522429 20 3 -
20507642 20 2 -
19471317 20 2 -
18607349 20 3 3
18451181 20 3 -
18273839 20 3 -
26913838 20 2 -
27060066 20 2 -
28324225 20 3 9
15026808 20 2 -
29884841 20 3 -
29988080 20 2 -
23108138 20 3 5
28339459 15 3 8
25003164 10 1 -
24817641 10 1 -
24122022 10 1 -
10464631 10 1 -
22753008 10 1 -
21735045 10 1 -
21638052 10 1 -
12145750 10 3 1
21309043 10 1 -
17924331 10 1 -
17899372 10 1 -
16792514 10 1 -
15744044 10 1 -
12845657 10 1 -
21344236 10 1 -
19332451 2 3 -
16825284 2 2 -
12915465 2 2 -
[[lovd]]
== LOVD
175 pmids and 1044 variants in normalized LOVD truth set
LOVD papers that we did not try and download:
18415037, 09585599, 08751436, 09497265, 09333265, 08531967, 08896551, 07939630, 09971877, 08942979, 09523200, 12900794, 09126734, 2010, 19150617, 09805131, 07545954
LOVD papers that we tried and failed to download:
17279547 (invalidPdf), 10506595 (HighwirePdfNotValid), 15533909 (HighwirePdfNotValid), 18680205 (invalidPdf), 17305420 (httpError), 12552570 (invalidPdf), 18493658 (noCrawlerSuccess), 16211554 (invalidPdf), 16969499 (noCrawlerSuccess), 12815598 (invalidPdf), 18693280 (invalidPdf), 12601471 (httpError), 11385711 (invalidPdf), 20020529 (invalidPdf), 16528612 (invalidPdf), 16619214 (invalidPdf), 20513136 (invalidPdf), 20054658 (invalidPdf), 16786532 (invalidPdf), 12955719 (invalidPdf), 10406662 (invalidPdf), 15300854 (invalidPdf), 12955716 (invalidPdf), 17657584 (invalidPdf), 18375895 (HighwirePdfNotValid), 19287957 (noCrawlerSuccess)
125 common pmids between LOVD and this crawl
Baseline confusion matrix against LOVD:
1138 2226
658 -1
Precision: 33.8% Recall: 63.4%
Current crawl confusion matrix against LOVD:
1137 2448
659 -1
Precision: 31.7% Recall: 63.3%
Done.
Although a very fuzzy process add a pytest unit test for a handfull of papers/variants