BRCAChallenge / literature-search

Crawl and annotate pubmed articles with variants
https://brcaexchange.org/
3 stars 6 forks source link

Debug missing founder mutation articles by variant reported from user #10

Closed rcurrie closed 5 years ago

rcurrie commented 5 years ago

Missing citations as reported by Charles Warden on the initial beta release.

BRCA1 185delAG: https://brcaexchange.org/variant/183889

Abeliovich et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9042909 (in article title)

Antoniou et al. 2005 à https://www.ncbi.nlm.nih.gov/pubmed/15994883 (in article title)

Chodick et al. 2008 à https://www.ncbi.nlm.nih.gov/pubmed/18158280 (in article abstract)

Elstrodt et al. 2006 à https://www.ncbi.nlm.nih.gov/pubmed/16397213 (in article abstract)

Finkelman et al. 2012 à https://www.ncbi.nlm.nih.gov/pubmed/22430266 (in article abstract)

Gabai-Kapara et al. 2014 à https://www.ncbi.nlm.nih.gov/pubmed/25192939 (in article introduction and methods)

King et al. 2003 à https://www.ncbi.nlm.nih.gov/pubmed/14576434 (in article)

Konishi et al. 2011 à https://www.ncbi.nlm.nih.gov/pubmed/21987798 (in abstract)

Linger and Kruk 2010 à https://www.ncbi.nlm.nih.gov/pubmed/20608970 (in article)

Satagopan et al. 2001 à https://www.ncbi.nlm.nih.gov/pubmed/11352856 (in abstract)

Satagopan et al. 2002 à https://www.ncbi.nlm.nih.gov/pubmed/12473589 (in abstract)

Stadler et al. 2012 à https://www.ncbi.nlm.nih.gov/pubmed/21598239 (in abstract)

Struewing et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9145676 (in abstract)

BRCA1 5382insC: https://brcaexchange.org/variant/180141

Abeliovich et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9042909 (in article title)

Antoniou et al. 2005 à https://www.ncbi.nlm.nih.gov/pubmed/15994883 (in article title)

Finkelman et al. 2012 à https://www.ncbi.nlm.nih.gov/pubmed/22430266 (in article)

Gabai-Kapara et al. 2014 à https://www.ncbi.nlm.nih.gov/pubmed/25192939 (in article introduction and methods)

King et al. 2003 à https://www.ncbi.nlm.nih.gov/pubmed/14576434 (in article)

Mgbemena et al. 2017 à https://www.ncbi.nlm.nih.gov/pubmed/28122244 (in abstract)

Satagopan et al. 2002 à https://www.ncbi.nlm.nih.gov/pubmed/12473589 (in abstract)

Struewing et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9145676 (in abstract)

BRCA2 6174delT: https://brcaexchange.org/variant/177049

Abeliovich et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9042909 (in article title)

Antoniou et al. 2005 à https://www.ncbi.nlm.nih.gov/pubmed/15994883 (in article title)

Chodick et al. 2008 à https://www.ncbi.nlm.nih.gov/pubmed/18158280 (in article abstract)

Finkelman et al. 2012 à https://www.ncbi.nlm.nih.gov/pubmed/22430266 (in article abstract)

Gabai-Kapara et al. 2014 à https://www.ncbi.nlm.nih.gov/pubmed/25192939 (in article introduction and methods)

Gallagher et al. 2010 à https://www.ncbi.nlm.nih.gov/pubmed/20215531 (in methods)

King et al. 2003 à https://www.ncbi.nlm.nih.gov/pubmed/14576434 (in article)

Satagopan et al. 2001 à https://www.ncbi.nlm.nih.gov/pubmed/11352856 (in abstract)

Satagopan et al. 2002 à https://www.ncbi.nlm.nih.gov/pubmed/12473589 (in abstract)

Struewing et al. 1997 à https://www.ncbi.nlm.nih.gov/pubmed/9145676 (in abstract)

rcurrie commented 5 years ago

BRCA2 6174delT: https://brcaexchange.org/variant/177049:

built lists this as chr13:g.32340526:AT>A which the hgvs library normalizes into NC_000013.11:g.32340526_32340528delinsA but the mention NM_000059.3:c.6174delT normalizes into NC_000013.11:g.32340529del and therefore there are no normalized matches.

melissacline commented 5 years ago

That's a great example of the synonym problem!

On Sun, Feb 10, 2019 at 3:55 PM Rob Currie notifications@github.com wrote:

BRCA2 6174delT: https://brcaexchange.org/variant/177049:

built lists this as chr13:g.32340526:AT>A which the hgvs library normalizes into NC_000013.11:g.32340526_32340528delinsA but the mention NM_000059.3:c.6174delT normalizes into NC_000013.11:g.32340529del and therefore there are no normalized matches.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BRCAChallenge/pubMunch-BRCA/issues/10#issuecomment-462193664, or mute the thread https://github.com/notifications/unsubscribe-auth/ABoqh1w1774r9l8y7d0Dr91AItgjJUm7ks5vMLFdgaJpZM4aiIgC .

rcurrie commented 5 years ago

I have a much more liberal approach which I'm evaluating on a crawl of just the papers listed in this bug report. 100% of first and last variant, only one paper out of the middle.

New algorithm tries to match conservatively and then more and more liberally per mention. Steps and some examples (first number is the PMID) below.

@melissacline @diekhans @maximilianh @amycoffin Would welcome cursory sense of if these progressively liberal steps seem reasonable based on scanning some of the below.

1) Normalize to hgvs genomic and look for an exact match (this is now the current site was built) 22430266 Normalized match: NM_007297.3:c.2830A>T => NC_000017.11:g.43092560T>A 22430266 Normalized match: NM_000059.3:c.6768T>A => NC_000013.11:g.32341123T>A 16397213 Normalized match: NM_007297.3:c.20A>G => NC_000017.11:g.43106507T>C

2) Normalize to hgvs and see if that 'string' is anywhere in a synonym list 9042909 Normalized matched synonym: NM_000059.3:c.6174T>None => NC_000013.11:g.32340529del => LRG_293t1.c.6174delT,NC_000013.11.g.32340529delT,NC_000013.10.g.32914666delT,NM_000059.3.c.6174delT,LRG_293p1.p.Phe2058Leufs,LRG_293.g.30050delT,U43746.1:c.6401delT

12473589 Normalized matched synonym: NM_000059.3:c.6174T>None => NC_000013.11:g.32340529del => LRG_293t1.c.6174delT,NC_000013.11.g.32340529delT,NC_000013.10.g.32914666delT,NM_000059.3.c.6174delT,LRG_293p1.p.Phe2058Leufs,LRG_293.g.30050delT,U43746.1:c.6401delT

15994883 Normalized matched synonym: NM_000059.3:c.6174T>None => NC_000013.11:g.32340529del => LRG_293t1.c.6174delT,NC_000013.11.g.32340529delT,NC_000013.10.g.32914666delT,NM_000059.3.c.6174delT,LRG_293p1.p.Phe2058Leufs,LRG_293.g.30050delT,U43746.1:c.6401delT

3) Fall back to the list of 'text' extracted by var finder and see if its in any synonyms

9042909 Texts matched synonym: 6174delT => p.S1982RfsX22,p.Ser1982fs,NC_000013.10.g.32914438del,p.Ser1982Argfs22,NM_000059.3.c.5946delT,U43746.1.n.6174delT,NM_000059.3.c.5946delT(6174delT),LRG_293.g.29822delT,p.S1982Rfs22,LRG_293p1.p.Ser1982Argfs,NG_012772.3.g.29822delT,6174delT,NM_000059.3.c.5946del,p.Ser1982Argfs,LRG_293t1.c.5946delT,p.Ser1982ArgfsX22,NC_000013.11.g.32340301delT,1-BP_DEL,6174DELT,NC_000013.10.g.32914438delT,NM_000059.3(BRCA2).c.5946delT,U43746.1:c.6173delT

14576434 Texts matched synonym: 185delAG => 187delAG,NR_027676.1.n.227_228delAG,LRG_292p1.p.Glu23Valfs,p.E23VfsX17,NC_000017.11.g.43124030_43124031delCT,LRG_292t1.c.68_69delAG,NM_007294.3.c.66_67delAG,NR_027676.1.c.229_230delAG,NM_007294.3.c.68_69delAG,NC_000017.11.g.43124028_43124029delCT,NC_000017.10.g.41276047_41276048delCT,NM007297.3.c.-22-21delAG,LRG_292.g.93955_93956delAG,185delAG,185_186delAG,NM_007294.3.c.68_69del,NC_000017.10.g.41276045_41276046delCT,p.Glu23Valfs,NM_007299.3.c.68_69delAG,NM_007294.3.c.66_67del,p.Glu23Valfs17,NG_005905.2.g.93955_93956delAG,187DELAG,NM_007294.3.c.68_69delAG(185delAG_or_187delAG),NG_005905.2.g.93953_93954delAG,U14680.1.c.66_67delAG,p.Glu23fs,U14680.1.n.185_186delAG,NM_007300.3.c.66_67delAG,p.E23VFS17,p.Glu23ValfsX17,185delAB,NM_007294.3(BRCA1).c.68_69delAG,NM_007294.2:c.68_69delAG,NM_007300.3:c.68_69delAG,NM_007299.3:c.68_69delAG,NM_007298.3:c.68_69delAG,NM007297.3:c.-20-19delAG,U14680.1:c.180_181delAG

21598239 Normalized matched synonym: NM_000059.3:c.6174T>None => NC_000013.11:g.32340529del => LRG_293t1.c.6174delT,NC_000013.11.g.32340529delT,NC_000013.10.g.32914666delT,NM_000059.3.c.6174delT,LRG_293p1.p.Phe2058Leufs,LRG_293.g.30050delT,U43746.1:c.6401delT

Fails Failed to match: hgvsCoding: Mapped: None Texts: G5255A Failed to match: hgvsCoding: Mapped: None Texts: G1639T Failed to match: hgvsCoding: NM_000410.3:c.845G>A Mapped: NC_000006.12:g.26092913G>A Texts: C282Y|C282Y Failed to match: hgvsCoding: Mapped: None Texts: S2A Failed to match: hgvsCoding: NM_001126112.2:c.2288T>None|NM_001276695.1:c.2288T>None|NM_000546.5:c.2288T>None Mapped: None Texts: 2288delT| 2288delT

???? 16397213 Texts matched synonym: 1G>A => U43746.1:c.-641G>A

16397213 Texts matched synonym: 3G>A => LRG_293t1.c.-481G>A,NM_000059.3.c.-481G>A,-253_G>A,-253 G>A,NC_000013.11.g.32315226G>A,LRG_293.g.4747G>A,NG_012772.3.g.4747G>A,NC_000013.10.g.32889363G>A,U43746.1:c.-253G>A

16397213 Texts matched synonym: 1G>A => U43746.1:c.-641G>A

melissacline commented 5 years ago

Well, we should definitely see if the converted hgvs is in the synonym list (#2). The pyhgvs 38 coordinates don't represent indels with the "most correct" hgvs, which utilizes the del/ins/dup syntax rather than the amino acid substitutions (e.g. A>AC). So, we're not going to get an exact match on the indels (#1). There are normal variations on that del/ins/duo syntax, such as whether it's more correct to list dup or dupA, but both are generally in the list of synonyms.

3 looks like a match of the variant portion only to the set of synonyms?

If so, that would get us into trouble. One often sees the same variant IDs in BRCA1 as BRCA2, not to mention extension to other genes. Or, is #3 something else?

rcurrie commented 5 years ago

BRCA1 5382insC: https://brcaexchange.org/variant/180141

Of all the papers listed, only one (22430266) has HGVS unambiguously listed. All the others have only the pattern "5382insC" which shows up in two synonym lists:

https://brcaexchange.org/variant/169240 (chr17:g.43057059:T>TG) and https://brcaexchange.org/variant/180141 (chr17:g.43057062:T>TG)

If we don't look for 'texts' (ie the original snippet that varFinder picked up in the paper before any normalization or canonicalization) then we'll miss all but one of these papers.

rcurrie commented 5 years ago

@melissacline Correct - #3 just looks for '5382insC' (the text that the crawler found that might be a variant) in any of the synonyms. As you mention this can go way wrong, for example '3G>A'. The damage is limited as we are only doing two genes, but expanding beyond that will make #3 un-tenable.

rcurrie commented 5 years ago

Plan ahead based on the discussion today in the BRCA Engineering call:

Keep the 1, 2, 3 approaches but emit a a rank metric so that the UI can sort these with the best matches on top. This will specifically address missing 185delAG while keeping the better exact matches at the top of the list.

rcurrie commented 5 years ago

New approach of trying 1, then 2 and only resorting to 3 if the other doesn't work captures ALL of the above paper links except for one, https://www.ncbi.nlm.nih.gov/pubmed/28122244, which has F22-24/5382insC as an exponent in the abstract and the crawler extracts /5382insC instead of 5382insC. Not going to try and sort that and risk causing other problems. Moving forward with testing this on the full crawl and will report back. So for now @maximilianh @diekhans @melissacline pause :-)

melissacline commented 5 years ago

Whoo hoo!!!

On Tue, Feb 12, 2019 at 4:34 PM Rob Currie notifications@github.com wrote:

New approach of trying 1, then 2 and only resorting to 3 if the other doesn't work captures ALL of the above paper links except for one, https://www.ncbi.nlm.nih.gov/pubmed/28122244, which has F22-24/5382insC as an exponent in the abstract and the crawler extracts /5382insC instead of 5382insC. Not going to try and sort that and risk causing other problems. Moving forward with testing this on the full crawl and will report back. So for now @maximilianh https://github.com/maximilianh @diekhans https://github.com/diekhans @melissacline https://github.com/melissacline pause :-)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BRCAChallenge/pubMunch-BRCA/issues/10#issuecomment-463003315, or mute the thread https://github.com/notifications/unsubscribe-auth/ABoqh5l2E9FF2-6HwBnmWu1jTqThbVnnks5vM12BgaJpZM4aiIgC .

rcurrie commented 5 years ago

Updated matching that leverages synonyms resolves 99%.