Closed rcurrie closed 5 years ago
@amycoffin @melissacline (Samantha is not on github it appears, will email her)
Below are the top 3 papers (I'll return all papers ranked in the real site) found for a few variants based on the new ranking. Would be useful to have you take very cursory look to see if the first paper is relevant and whether its the best, or rather not significantly inferior to the next one.
chr13:g.32363367:C>G
[{'pmid': 22962691,
'points': 40,
'snippets': [', in the BRCA2 gene, besides the exon 7 mutations described '
'above, only one mutation in exon 3, c.231T>G (p.Thr77Thr),16 '
'and two mutations in exon 18,<<< c.8165C>G>>> (p.Thr2722Arg)28 '
'and c.7992T>A (p.Ile2664Ile),16 have been reported to induce '
'exon skipping by altering splicing regulatory elements.',
'BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
'skipping.',
'CA2 gene, besides the exon 7 mutations described above, only '
'one mutation in exon 3, c.231T>G (p.Thr77Thr),16 and two '
'mutations in exon 18, c.8165C>G <<<(p.Thr2722Arg>>>)28 and '
'c.7992T>A (p.Ile2664Ile),16 have been reported to induce exon '
'skipping by altering splicing regulatory elements.']},
{'pmid': 18424508,
'points': 30,
'snippets': ['BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
'skipping.',
'Of these, the exonic variant BRCA2 c.8162T→C in exon 18 '
'affects a position only three nucleotides upstream of the '
'mutation BRCA2 c.8165C→G (predicting<<< p.T2722R>>>), which is '
'known to cause exon skipping by disrupting several ESE '
'sites.20 Two variants, BRCA2 c.316+5G→C (IVS3+5G→C) and '
'c.7805G→C, at the last base of exon 16, induced strong effects '
'on splicing (figs 2A,B and 2C,D, respectively).',
'Of these, the exonic variant BRCA2 c.8162TRC in exon 18 '
'affects a position only three nucleotides upstream of the '
'mutation BRCA2<<< c.8165CRG>>> (predicting p.T2722R), which is '
'known to cause exon skipping by disrupting several ESE '
'sites.20 Two variants, BRCA2 c.316+5GRC (IVS3+5GRC) and '
'c.7805']},
{'pmid': 20215541,
'points': 30,
'snippets': ['None of the five missense mutations directed to conserved ESE '
'significantly altered splicing, although variant<<< '
'c.8165C>G>>> (34) showed a weak exon 18 skipping in minigene '
'context, also detected by other authors (33, 35). Six '
'positive splicing mutations showed simultaneou',
'BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
'skipping.',
'None of the five missense mutations directed to conserved ESE '
'significantly altered splicing, although variant<<< '
'c.8165C>G>>> (34) showed a weak exon 18 skipping in minigene '
'context, also detected by other authors (33, 35). Six positive '
'splicing mutations showed simultaneous']}]
chr17:g.43124027:ACT>A
[{'pmid': 20608970,
'points': 15,
'snippets': ['Library CAS PubMed Web of Science®Google ScholarUC-eLinks * '
'29 Buisson M, Anczukow O, Zetoune AB, Ware MD & Mazoyer S '
'(2006) The 185delAG mutation <<<(c.68_69delAG>>>) in the BRCA1 '
'gene triggers translation reinitiation at a downstream AUG '
'codon.',
'lar, cellular and clinical impact for mutation carriers. ### '
'Abbreviations * BARD1 * BRCA1‐associated RING domain '
'protein 1 * BRAT * BRCA1<<< 185delAG>>> truncation * '
'BRCA1 * breast cancer susceptibility gene 1 * BRCT * '
'BRCA1 C‐terminus ## Introduction Family history is the '
'strongest risk ',
'n Y1853X, which lacks the last 11 amino acids, is only missing '
'a small portion of the second BRCT (BRCA1 C‐terminus) repeat, '
'whereas the 39 amino acid<<< 185delAG>>> mutant lacks all of '
'BRCA1’s known functional domains. image Figure 1 Open in '
'figure viewerPowerPoint BRCA1 mutations and their cellular '
'and physi']},
{'pmid': 16267036,
'points': 12,
'snippets': ['It has previously been reported that this effect is '
'responsible for the finding that the most prevalent BRCA1 '
'mutation,<<< 187delAG>>>, is found on two haplotypes ( 12, '
'21).',
'The<<< 185delAG>>> BRCA1 mutation originated before the '
'dispersion of Jews in the diaspora and is not limited to '
'Ashkenazim.',
'It has previously been reported that this effect is '
'responsible for the finding that the most prevalent BRCA1 '
'mutation,<<< 187delAG>>>, is found on two haplotypes (12, '
'21).']},
{'pmid': 15235020,
'points': 12,
'snippets': ['(Under this convention, the two mutations commonly referred to '
'as “185delAG” and “5382insC” are named<<< 187delAG>>> and '
'5385insC, respectively.',
'mations of the expected number of Ashkenazi homozygotes and '
'compound heterozygotes There are two BRCA1 founder mutations '
'in the Ashkenazi population,<<< 185delAG>>> and 5382insC.',
'ividuals is unlikely to produce interesting results. However, '
'in the Ashkenazi Jewish population, there are two founder '
'frameshift mutations in BRCA1:<<< 185delAG>>> and 5382insC.']}]
chr13:g.32340526:AT>A
[{'pmid': 15695382,
'points': 20,
'snippets': ['The BRCA2 truncating<<< 6174delT>>> Ashkenazi Jewish founder '
'mutation associated with a breast cancer risk of 70% by age 70 '
'(32) was also included as a positive/inactivating control, wh',
'The localization of BRCA2 and<<< 6174delT>>> to the nucleus '
'and cytoplasm ( Table 2 ; Fig. 1C), respectively, was '
'consistent with the C-terminal location of the human BRCA2 '
'nuclear localization signals (4).',
'VC8 cells transiently transfected with GFP-tagged wtBRCA2 '
'and<<< 6174delT>>> BRCA2 and flow sorted for GFP were '
'evaluated for MMC sensitivity by clonogenic survival assay and '
'trypan blue exclusion. % surviving colonies and % viable cells '
'in treated relative to untreated cells was plotted against MMC '
'concentration.']},
{'pmid': 15235020,
'points': 20,
'snippets': ['ponent of BRACAnalysis®, Myriad Genetic Laboratories offers a '
'genotyping test for these two mutations, as well as the '
'Ashkenazi BRCA2 founder mutation<<< 6174delT>>>, called '
'Multisite3. At the time that 20 000 full sequence tests had '
'been completed, 6895 Multisite3 tests had also been completed: '
'in that sample set, 745 individuals were found to carry '
'185delAG and 222 to carry 5382insC.',
'ponent of BRACAnalysisH, Myriad Genetic Laboratories offers a '
'genotyping test for these two mutations, as well as the '
'Ashkenazi BRCA2 founder mutation<<< 6174delT>>>, called '
'Multisite3.']},
{'pmid': 22144684,
'points': 20,
'snippets': ['5946delT exon 11 c.6275_6276delTT exon 11 c.6644_6647delACTC '
'exon 11 c.9026_9030delATCAT exon 23 (999del5) '
'(1538del4) (3036del4) (5873C>A) <<<(6174delT>>>) '
'(6503delTT) (6872del4) (9254del5) Iceland European '
'Western Europe East European Ashkenazi Jews Dutch French '
'Northeast Spanish IDF 4',
'3 3 137 (4.34%) 21 5 1 6 3 3 2 41 (1.3%) 105 (0.93%) 8 – – 7 '
'1 1 – 17 (0.54%) 26 (0.23%) 13 1 – 2 2 – 1 19 (0.6%) 1087 '
'(9.49%) c.5946delT exon 11 <<<(6174delT>>>) Ashkenazi Jews '
'126 (1.11%) 17 25 – 1 1 2 2 48 (1.52%) c.4327C>T exon 13 '
'(4446C>T) French Canadian 8 2 1 – 3 – 1 15 (0.48%) 93 '
'(0.82%) c.6275_62']}]
Just writing a comment so that my ID is present in the repo. I will take a look at this hopefully by the end of the week, definitely by next Tuesday
Following up a deeper look at chr13:g.32363367:C>G. The lists above only included the top 3 hits for each variant - Sorry! I should have included the full set of results for each variant but didn't want to make you look through too much. Below is a list of all papers found for this variant ordered by points and their rank in the HGMD results that @sambaxter shared (Thank You!, this will be added to the stats report):
Missing from HGMD {'29446198', '19043619', '29394989'}
Not Crawled: set()
pmid points #snips HGMD
22962691 40 3 -
21990134 40 3 4
24323938 40 3 6
25146914 40 3 7
20215541 30 3 -
18424508 30 3 -
30039884 20 3 -
21990165 20 2 -
23586058 20 2 -
20690207 20 2 -
20522429 20 3 -
20507642 20 2 -
19471317 20 2 -
18607349 20 3 3
18451181 20 3 -
18273839 20 3 -
26913838 20 2 -
27060066 20 2 -
28324225 20 3 9
15026808 20 2 -
29884841 20 3 -
29988080 20 2 -
23108138 20 3 5
28339459 15 3 8
25003164 10 1 -
24817641 10 1 -
24122022 10 1 -
10464631 10 1 -
22753008 10 1 -
21735045 10 1 -
21638052 10 1 -
12145750 10 3 1
21309043 10 1 -
17924331 10 1 -
17899372 10 1 -
16792514 10 1 -
15744044 10 1 -
12845657 10 1 -
21344236 10 1 -
19332451 2 3 -
16825284 2 2 -
12915465 2 2 -
@sambaxter The Fackenthal paper is included (good) but appears way down the list. The crawler extracts the following HGVS from the paper:
'NM_000059.3:c.8393C>G'
'NM_004011.3:c.8393C>G|NM_004009.3:c.8393C>G|NM_004006.2:c.8393C>G'
'NM_007294.3:c.5080G>T', 'NM_007294.3:c.5080G>T',
'NM_000059.3:c.8165C>G'
In BRCA Exchange chr13:g.32363367:C>G is listed with coding HGVS as NM_000059.3.c.8165C>G and which matches exactly into the above list so the paper gets 10 points. My 10, 7, len - 5 scale was totally arbitrary for this new feature. We may want to tilt things much higher say giving exact matches 20 or 30 points, or trying to see how many times the match appears (something the crawler's code flow at the moment doesn't really support). The other point that a match in the title or abstract should receive much higher points is a great idea as well - I need to see if the code flow supports it easily at this point. Thank you again!
To complete the picture here are the raw 'texts' that the crawler found in Fackenthal/12145750 array(['8393CrG| 8393CrG', '8393CrG| 8393CrG', 'Glu1694Ter', 'E1694X', 'T2722R| T2722R'], dtype=object)
The crawler now calculates a 'points' per hit and when no matches of parsed HGVS can be found falls back to looking for the text phrase found in the paper in any synonyms in BRCA Exchange. Changes: