BRCAChallenge / literature-search

Crawl and annotate pubmed articles with variants
https://brcaexchange.org/
3 stars 6 forks source link

Rank papers based on the quality of match #13

Closed rcurrie closed 5 years ago

rcurrie commented 5 years ago

The crawler now calculates a 'points' per hit and when no matches of parsed HGVS can be found falls back to looking for the text phrase found in the paper in any synonyms in BRCA Exchange. Changes:

rcurrie commented 5 years ago

@amycoffin @melissacline (Samantha is not on github it appears, will email her)

Below are the top 3 papers (I'll return all papers ranked in the real site) found for a few variants based on the new ranking. Would be useful to have you take very cursory look to see if the first paper is relevant and whether its the best, or rather not significantly inferior to the next one.

chr13:g.32363367:C>G

[{'pmid': 22962691,
  'points': 40,
  'snippets': [', in the BRCA2 gene, besides the exon 7 mutations described '
               'above, only one mutation in exon 3, c.231T>G (p.Thr77Thr),16 '
               'and two mutations in exon 18,<<< c.8165C>G>>> (p.Thr2722Arg)28 '
               'and c.7992T>A (p.Ile2664Ile),16 have been reported to induce '
               'exon skipping by altering splicing regulatory elements.',
               'BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
               'skipping.',
               'CA2 gene, besides the exon 7 mutations described above, only '
               'one mutation in exon 3, c.231T>G (p.Thr77Thr),16 and two '
               'mutations in exon 18, c.8165C>G <<<(p.Thr2722Arg>>>)28 and '
               'c.7992T>A (p.Ile2664Ile),16 have been reported to induce exon '
               'skipping by altering splicing regulatory elements.']},
 {'pmid': 18424508,
  'points': 30,
  'snippets': ['BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
               'skipping.',
               'Of these, the exonic variant BRCA2 c.8162T→C in exon 18 '
               'affects a position only three nucleotides upstream of the '
               'mutation BRCA2 c.8165C→G (predicting<<< p.T2722R>>>), which is '
               'known to cause exon skipping by disrupting several ESE '
               'sites.20  Two variants, BRCA2 c.316+5G→C (IVS3+5G→C) and '
               'c.7805G→C, at the last base of exon 16, induced strong effects '
               'on splicing (figs 2A,B and 2C,D, respectively).',
               'Of these, the exonic variant BRCA2 c.8162TRC in exon 18 '
               'affects a position only three nucleotides upstream of the '
               'mutation BRCA2<<< c.8165CRG>>> (predicting p.T2722R), which is '
               'known to cause exon skipping by disrupting several ESE '
               'sites.20 Two variants, BRCA2 c.316+5GRC (IVS3+5GRC) and '
               'c.7805']},
 {'pmid': 20215541,
  'points': 30,
  'snippets': ['None of the five missense mutations directed to conserved ESE '
               'significantly altered splicing, although variant<<< '
               'c.8165C>G>>> (34) showed a weak exon 18 skipping in minigene '
               'context, also detected by other authors (33, 35).  Six '
               'positive splicing mutations showed simultaneou',
               'BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
               'skipping.',
               'None of the five missense mutations directed to conserved ESE '
               'significantly altered splicing, although variant<<< '
               'c.8165C>G>>> (34) showed a weak exon 18 skipping in minigene '
               'context, also detected by other authors (33, 35). Six positive '
               'splicing mutations showed simultaneous']}]

chr17:g.43124027:ACT>A

[{'pmid': 20608970,
  'points': 15,
  'snippets': ['Library CAS PubMed Web of Science®Google ScholarUC-eLinks    * '
               '29 Buisson M, Anczukow O, Zetoune AB, Ware MD & Mazoyer S '
               '(2006) The 185delAG mutation <<<(c.68_69delAG>>>) in the BRCA1 '
               'gene triggers translation reinitiation at a downstream AUG '
               'codon.',
               'lar, cellular and clinical impact for mutation carriers.  ### '
               'Abbreviations  * BARD1      * BRCA1‐associated RING domain '
               'protein 1 * BRAT      * BRCA1<<< 185delAG>>> truncation * '
               'BRCA1      * breast cancer susceptibility gene 1 * BRCT      * '
               'BRCA1 C‐terminus  ##  Introduction  Family history is the '
               'strongest risk ',
               'n Y1853X, which lacks the last 11 amino acids, is only missing '
               'a small portion of the second BRCT (BRCA1 C‐terminus) repeat, '
               'whereas the 39 amino acid<<< 185delAG>>> mutant lacks all of '
               'BRCA1’s known functional domains.  image  Figure 1  Open in '
               'figure viewerPowerPoint  BRCA1 mutations and their cellular '
               'and physi']},
 {'pmid': 16267036,
  'points': 12,
  'snippets': ['It has previously been reported that this effect is '
               'responsible for the finding that the most prevalent BRCA1 '
               'mutation,<<< 187delAG>>>, is found on two haplotypes ( 12, '
               '21).',
               'The<<< 185delAG>>> BRCA1 mutation originated before the '
               'dispersion of Jews in the diaspora and is not limited to '
               'Ashkenazim.',
               'It has previously been reported that this effect is '
               'responsible for the finding that the most prevalent BRCA1 '
               'mutation,<<< 187delAG>>>, is found on two haplotypes (12, '
               '21).']},
 {'pmid': 15235020,
  'points': 12,
  'snippets': ['(Under this convention, the two mutations commonly referred to '
               'as “185delAG” and “5382insC” are named<<< 187delAG>>> and '
               '5385insC, respectively.',
               'mations of the expected number of Ashkenazi homozygotes and '
               'compound heterozygotes  There are two BRCA1 founder mutations '
               'in the Ashkenazi population,<<< 185delAG>>> and 5382insC.',
               'ividuals is unlikely to produce interesting results. However, '
               'in the Ashkenazi Jewish population, there are two founder '
               'frameshift mutations in BRCA1:<<< 185delAG>>> and 5382insC.']}]

chr13:g.32340526:AT>A

[{'pmid': 15695382,
  'points': 20,
  'snippets': ['The BRCA2 truncating<<< 6174delT>>> Ashkenazi Jewish founder '
               'mutation associated with a breast cancer risk of 70% by age 70 '
               '(32) was also included as a positive/inactivating control, wh',
               'The localization of BRCA2 and<<< 6174delT>>> to the nucleus '
               'and cytoplasm ( Table 2 ; Fig. 1C), respectively, was '
               'consistent with the C-terminal location of the human BRCA2 '
               'nuclear localization signals (4).',
               'VC8 cells transiently transfected with GFP-tagged wtBRCA2 '
               'and<<< 6174delT>>> BRCA2 and flow sorted for GFP were '
               'evaluated for MMC sensitivity by clonogenic survival assay and '
               'trypan blue exclusion. % surviving colonies and % viable cells '
               'in treated relative to untreated cells was plotted against MMC '
               'concentration.']},
 {'pmid': 15235020,
  'points': 20,
  'snippets': ['ponent of BRACAnalysis®, Myriad Genetic Laboratories offers a '
               'genotyping test for these two mutations, as well as the '
               'Ashkenazi BRCA2 founder mutation<<< 6174delT>>>, called '
               'Multisite3. At the time that 20 000 full sequence tests had '
               'been completed, 6895 Multisite3 tests had also been completed: '
               'in that sample set, 745 individuals were found to carry '
               '185delAG and 222 to carry 5382insC.',
               'ponent of BRACAnalysisH, Myriad Genetic Laboratories offers a '
               'genotyping test for these two mutations, as well as the '
               'Ashkenazi BRCA2 founder mutation<<< 6174delT>>>, called '
               'Multisite3.']},
 {'pmid': 22144684,
  'points': 20,
  'snippets': ['5946delT exon 11  c.6275_6276delTT exon 11  c.6644_6647delACTC '
               'exon 11  c.9026_9030delATCAT exon 23     (999del5)  '
               '(1538del4)  (3036del4)  (5873C>A)  <<<(6174delT>>>)  '
               '(6503delTT)  (6872del4)  (9254del5)     Iceland  European  '
               'Western Europe  East European  Ashkenazi Jews  Dutch  French  '
               'Northeast Spanish   IDF  4',
               '3 3 137 (4.34%)  21 5 1 6 3 3 2 41 (1.3%) 105 (0.93%)  8 – – 7 '
               '1 1 – 17 (0.54%) 26 (0.23%)  13 1 – 2 2 – 1 19 (0.6%) 1087 '
               '(9.49%)  c.5946delT exon 11 <<<(6174delT>>>) Ashkenazi Jews  '
               '126 (1.11%)  17 25 – 1 1 2 2 48 (1.52%)  c.4327C>T exon 13 '
               '(4446C>T) French Canadian  8 2 1 – 3 – 1 15 (0.48%) 93 '
               '(0.82%)  c.6275_62']}]
sambaxter commented 5 years ago

Just writing a comment so that my ID is present in the repo. I will take a look at this hopefully by the end of the week, definitely by next Tuesday

rcurrie commented 5 years ago

Following up a deeper look at chr13:g.32363367:C>G. The lists above only included the top 3 hits for each variant - Sorry! I should have included the full set of results for each variant but didn't want to make you look through too much. Below is a list of all papers found for this variant ordered by points and their rank in the HGMD results that @sambaxter shared (Thank You!, this will be added to the stats report):

Missing from HGMD {'29446198', '19043619', '29394989'}
Not Crawled: set()
pmid        points  #snips  HGMD
22962691    40  3   -
21990134    40  3   4
24323938    40  3   6
25146914    40  3   7
20215541    30  3   -
18424508    30  3   -
30039884    20  3   -
21990165    20  2   -
23586058    20  2   -
20690207    20  2   -
20522429    20  3   -
20507642    20  2   -
19471317    20  2   -
18607349    20  3   3
18451181    20  3   -
18273839    20  3   -
26913838    20  2   -
27060066    20  2   -
28324225    20  3   9
15026808    20  2   -
29884841    20  3   -
29988080    20  2   -
23108138    20  3   5
28339459    15  3   8
25003164    10  1   -
24817641    10  1   -
24122022    10  1   -
10464631    10  1   -
22753008    10  1   -
21735045    10  1   -
21638052    10  1   -
12145750    10  3   1
21309043    10  1   -
17924331    10  1   -
17899372    10  1   -
16792514    10  1   -
15744044    10  1   -
12845657    10  1   -
21344236    10  1   -
19332451    2   3   -
16825284    2   2   -
12915465    2   2   -

@sambaxter The Fackenthal paper is included (good) but appears way down the list. The crawler extracts the following HGVS from the paper:

'NM_000059.3:c.8393C>G'
'NM_004011.3:c.8393C>G|NM_004009.3:c.8393C>G|NM_004006.2:c.8393C>G'
'NM_007294.3:c.5080G>T', 'NM_007294.3:c.5080G>T',
'NM_000059.3:c.8165C>G'

In BRCA Exchange chr13:g.32363367:C>G is listed with coding HGVS as NM_000059.3.c.8165C>G and which matches exactly into the above list so the paper gets 10 points. My 10, 7, len - 5 scale was totally arbitrary for this new feature. We may want to tilt things much higher say giving exact matches 20 or 30 points, or trying to see how many times the match appears (something the crawler's code flow at the moment doesn't really support). The other point that a match in the title or abstract should receive much higher points is a great idea as well - I need to see if the code flow supports it easily at this point. Thank you again!

rcurrie commented 5 years ago

15 Added as feature to increase points for hits in title/abstract

rcurrie commented 5 years ago

To complete the picture here are the raw 'texts' that the crawler found in Fackenthal/12145750 array(['8393CrG| 8393CrG', '8393CrG| 8393CrG', 'Glu1694Ter', 'E1694X', 'T2722R| T2722R'], dtype=object)