BRCAChallenge / brca-exchange

Overall management and deployment of the BRCA Exchange web portal and pipeline scripts
http://brcaexchange.org
28 stars 32 forks source link

Literature Search Crawl 2019-08-26 #1104

Open rcurrie opened 5 years ago

rcurrie commented 5 years ago

Literature search results json and markdown stats as of 2019-08-26 are available. 50 new papers added and 269 additional variants matched since the last crawl of 2019-07-03 detailed in issue #1078.

@letitiaismyname and @diekhans take a look at the stats if you'd like, look stable from last run.

@zfisch push when conveniant

Literature Search Crawl Report

Counts

Stats for crawl on 2019-08-26T23:33:38 (in /crawl/pubs-date.txt)
Exported 4051 Papers and 11647 Variants
350 papers added since baseline on 2019-02-27T16:26:37-0800 of 3701 papers and 9886 variants

Download

Attempted to download 18238 papers

Succeeded in downloading 15123 (83%)

Failure reasons:
invalidPdf             1040
noCrawlerSuccess        952
httpError               330
noLicense               328
HighwirePdfNotValid     267
ovidMetaParseFailed      90
invalidHostname          56
pageErrorMessage         35
no_meta                  10
noSpringerLicense         8
noOutlinkOrDoi            6
tooManySupplFiles         1
BeautifulSoupError        1
HtmlParseError            1
Name: status, dtype: int64

Find

14189 Papers didn't yield any variants
Tried to match against 26375 variants in BRCA Exchange
Found 462262 total mentions
Succesfully matched 203710 mentions

Match points distribution:
mean     5.342286
std      3.319555
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max     50.000000
Name: points, dtype: float64

Founder Mutations

All founder mutation papers succesfully downloaded

5382insC (chr17:g.43057062:T>TG) failed to match {'28122244'}

HGMD for chr13:g.32363367:C>G

HGMD matching failures {'29446198', '29394989', '19043619'}
HGMD ranking and points:
pmid        points  #snips  HGMD
22962691    40  3   -
21990134    40  3   4
24323938    40  3   6
25146914    40  3   7
20215541    30  3   -
18424508    30  3   -
30039884    20  3   -
21990165    20  2   -
23586058    20  2   -
20690207    20  2   -
20522429    20  3   -
20507642    20  2   -
19471317    20  2   -
18607349    20  3   3
18451181    20  3   -
18273839    20  3   -
26913838    20  2   -
27060066    20  2   -
28324225    20  3   9
15026808    20  2   -
29884841    20  3   -
29988080    20  2   -
23108138    20  3   5
28339459    15  3   8
25003164    10  1   -
24817641    10  1   -
24122022    10  1   -
10464631    10  1   -
22753008    10  1   -
21735045    10  1   -
21638052    10  1   -
12145750    10  3   1
21309043    10  1   -
17924331    10  1   -
17899372    10  1   -
16792514    10  1   -
15744044    10  1   -
12845657    10  1   -
21344236    10  1   -
19332451    2   3   -
16825284    2   2   -
12915465    2   2   -

LOVD

175 pmids and 1044 variants in normalized LOVD truth set

LOVD papers that we did not try and download:
07939630, 09333265, 09971877, 09523200, 09497265, 2010, 07545954, 18415037, 09585599, 08896551, 19150617, 08942979, 08751436, 09805131, 09126734, 08531967, 12900794

LOVD papers that we tried and failed to download:
17305420 (httpError), 20513136 (invalidPdf), 18493658 (noCrawlerSuccess), 16786532 (invalidPdf), 15300854 (invalidPdf), 12815598 (invalidPdf), 10406662 (invalidPdf), 16969499 (noCrawlerSuccess), 17279547 (invalidPdf), 15533909 (HighwirePdfNotValid), 12955716 (invalidPdf), 10506595 (HighwirePdfNotValid), 16619214 (invalidPdf), 17657584 (invalidPdf), 19287957 (noCrawlerSuccess), 11385711 (invalidPdf), 18680205 (invalidPdf), 16211554 (invalidPdf), 18693280 (invalidPdf), 18375895 (HighwirePdfNotValid), 20054658 (invalidPdf), 12552570 (invalidPdf), 16528612 (invalidPdf), 12955719 (invalidPdf), 20020529 (invalidPdf), 12601471 (httpError)

125 common pmids between LOVD and this crawl

Baseline confusion matrix against LOVD:
1138    2226
658 -1
Precision: 33.8% Recall: 63.4%

Current crawl confusion matrix against LOVD:
1137    2476
659 -1
Precision: 31.5% Recall: 63.3%
zfisch commented 5 years ago

Deployed to production 👌