BRCAChallenge / literature-search

Crawl and annotate pubmed articles with variants
https://brcaexchange.org/
3 stars 6 forks source link

Update literature.json structure to support ranking by points #12

Closed rcurrie closed 5 years ago

rcurrie commented 5 years ago

The current literature.json uses a dictionary by pmid of snippets per variant:

{'chr13:g.32363367:C>G': {'10464631': ['6        * BRCA2<<< T2722R>>> Is a '
                                       'Deleterious Allele That Causes Exon '
                                       'Skipping  The American Journal of '
                                       'Human Genetics, Vol. 71, No. 3        '
                                       '* Novel germline CDH1 mutations in h'],
                          '12145750': ['71:625–631, 2002  Report  BRCA2<<< '
                                       'T2722R>>> Is a Deleterious Allele That '
                                       'Causes Exon Skipping James D.',
                                       'These predictions revealed that '
                                       'BRCA2<<< T2722R>>> (8393CrG), which '
                                       'segregates with affected individuals '
                                       'in a family with breast cancer, '
                                       'disrupts three potential ESE sites.',
                                       'Of these, only one allele, BRCA2<<< '
                                       'T2722R>>>, segregated with affected '
                                       'members of a family with breast cancer '
                                       '(family number 98-11).'],
                                       ...

Which does not allow for the papers to be ordered by quality of hit. Two general options are to switch to a list of tuples sorted by points of (pmid, points, list of top 3 snippets for this paper):

[(22962691,
  40,
  [', in the BRCA2 gene, besides the exon 7 mutations described above, only '
   'one mutation in exon 3, c.231T>G (p.Thr77Thr),16 and two mutations in exon '
   '18,<<< c.8165C>G>>> (p.Thr2722Arg)28 and c.7992T>A (p.Ile2664Ile),16 have '
   'been reported to induce exon skipping by altering splicing regulatory '
   'elements.',
   'BRCA2<<< T2722R>>> is a deleterious allele that causes exon skipping.',
   'CA2 gene, besides the exon 7 mutations described above, only one mutation '
   'in exon 3, c.231T>G (p.Thr77Thr),16 and two mutations in exon 18, '
   'c.8165C>G <<<(p.Thr2722Arg>>>)28 and c.7992T>A (p.Ile2664Ile),16 have been '
   'reported to induce exon skipping by altering splicing regulatory '
   'elements.']),
 (18424508,
  30,
  ['BRCA2<<< T2722R>>> is a deleterious allele that causes exon skipping.',
   'Of these, the exonic variant BRCA2 c.8162T→C in exon 18 affects a position '
   'only three nucleotides upstream of the mutation BRCA2 c.8165C→G '
   '(predicting<<< p.T2722R>>>), which is known to cause exon skipping by '
   'disrupting several ESE sites.20  Two variants, BRCA2 c.316+5G→C '
   '(IVS3+5G→C) and c.7805G→C, at the last base of exon 16, induced strong '
   'effects on splicing (figs 2A,B and 2C,D, respectively).',
   'Of these, the exonic variant BRCA2 c.8162TRC in exon 18 affects a position '
   'only three nucleotides upstream of the mutation BRCA2<<< c.8165CRG>>> '
   '(predicting p.T2722R), which is known to cause exon skipping by disrupting '
   'several ESE sites.20 Two variants, BRCA2 c.316+5GRC (IVS3+5GRC) and '
   'c.7805']),

or a list of dictionaries with points where you'd need to sort and show based on points:

{12145750: {'points': 10,
            'snippets': ['71:625–631, 2002  Report  BRCA2<<< T2722R>>> Is a '
                         'Deleterious Allele That Causes Exon Skipping James '
                         'D.',
                         'These predictions revealed that BRCA2<<< T2722R>>> '
                         '(8393CrG), which segregates with affected '
                         'individuals in a family with breast cancer, disrupts '
                         'three potential ESE sites.',
                         'Of these, only one allele, BRCA2<<< T2722R>>>, '
                         'segregated with affected members of a family with '
                         'breast cancer (family number 98-11).']},
 12915465: {'points': 2,
            'snippets': ['(  2002  ) BRCA2<<< T2722R>>> is a deleterious '
                         'allele that causes exon skipping.  Am.',
                         '(2002) BRCA2<<< T2722R>>> is a deleterious allele '
                         'that causes exon skipping.']},

@falquaddoomi @zfisch Let me know which you'd prefer - no massive rush as I have some additional validation on the results to do.

rcurrie commented 5 years ago

Or...

chr13:g.32363367:C>G
[{'pmid': 22962691,
  'points': 40,
  'snippets': [', in the BRCA2 gene, besides the exon 7 mutations described '
               'above, only one mutation in exon 3, c.231T>G (p.Thr77Thr),16 '
               'and two mutations in exon 18,<<< c.8165C>G>>> (p.Thr2722Arg)28 '
               'and c.7992T>A (p.Ile2664Ile),16 have been reported to induce '
               'exon skipping by altering splicing regulatory elements.',
               'BRCA2<<< T2722R>>> is a deleterious allele that causes exon '
               'skipping.',
               'CA2 gene, besides the exon 7 mutations described above, only '
               'one mutation in exon 3, c.231T>G (p.Thr77Thr),16 and two '
               'mutations in exon 18, c.8165C>G <<<(p.Thr2722Arg>>>)28 and '
               'c.7992T>A (p.Ile2664Ile),16 have been reported to induce exon '
               'skipping by altering splicing regulatory elements.']},
rcurrie commented 5 years ago

There's also a new 'date' field in ISO format indicating the day the crawl happened:

{
    "date": "2018-11-14T12:00:00-0800",
    "papers": ...
    "variants": ...
}
rcurrie commented 5 years ago

I've generated a preliminary literature.json using the same set of papers we downloaded for the current production but with the new matching and points:

3615 Papers and 9783 Variants vs. 2227 Papers and 3754 Variants currently in production

I have a bit more cleanup to do but then I'm going to do a new crawl to add papers from 11/14 till now so that once we settle on a new format we can use the latest papers as well.

melissacline commented 5 years ago

Awesome!

On Wed, Feb 27, 2019 at 4:21 PM Rob Currie notifications@github.com wrote:

I've generated a preliminary literature.json http://public.gi.ucsc.edu/~rcurrie/literature-2019-02-27.json using the same set of papers we downloaded for the current production but with the new matching and points:

3615 Papers and 9783 Variants vs. 2227 Papers and 3754 Variants currently in production

I have a bit more cleanup to do but then I'm going to do a new crawl to add papers from 11/14 till now so that once we settle on a new format we can use the latest papers as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BRCAChallenge/pubMunch-BRCA/issues/12#issuecomment-468085114, or mute the thread https://github.com/notifications/unsubscribe-auth/ABoqh_1T2ZwKPLA9k5sKKAWhgAy4wk-qks5vRyERgaJpZM4bS2CD .

rcurrie commented 5 years ago

Ran an updated crawl for all PMIDs as of yesterday with BRCA in the title or abstract:

17459 vs. 16980 PMIDs attempted (479 papers since last crawl on 11/18)

14520 (83%) vs. 14088 (83%) succeeded

3701 Papers and 9886 Variants exported (86 papers that were able to be downloaded and variants found since 11/18)

rcurrie commented 5 years ago

@zfisch New crawl with papers as of yesterday in the new format (last one above with full dictionary per variant) is here

zfisch commented 5 years ago

Pull request on brcaexchange to handle new data and sorting at https://github.com/BRCAChallenge/brca-exchange/pull/1016 👍