CeON / CERMINE

Content ExtRactor and MINEr
GNU Affero General Public License v3.0
487 stars 99 forks source link

CRF-based Affiliation parser fails with StackOverflowError on large input text #31

Closed marekhorst closed 8 years ago

marekhorst commented 8 years ago

It seems StackOverflowError is thrown by Mallet library:

2016-09-16 00:56:09,493 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.StackOverflowError
    at java.lang.StringBuffer.append(StringBuffer.java:272)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:487)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
    at edu.umass.cs.mallet.grmm.inference.TRP.lambdaPropagation(TRP.java:491)
    [...]

when providing large text input to CRFAffiliationParser#parse().

After several tests it turned out affiliation text exceeding 8000-9000 characters causes mentioned problem.

Here is an example causing StackOverflowError:

Affiliations of authors:Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, UK (QG, JT, AMD, MS, JEA, DFE, PDPP); Netherlands Cancer Institute, Antoni van Leeuwenhoek hospital, Amsterdam, the Netherlands (MKS, SC, AB, FBH); Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, SH, DJH, SL); Program in Genetic Epidemiology and Statistical Genetics, Department of Epidemiology, Harvard School of Public Health, Boston, MA (PK, CCh, DJH, SL); Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland (SK, RF, TAM, HN); Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK (MKB, QW, JD, KM, ML, SK, DFE, PDPP); Department of Genetics, QIMR Berghofer Medical Research Institute, Brisbane, Australia (JBee, GCT); Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden (KC, HD, ME, JiL, JBr, KH, PH); Laboratory for Translational Genetics, Department of Oncology, University of Leuven, Leuven, Belgium (DL); Vesalius Research Center, VIB, Leuven, Belgium (DL); Oncology Department, University Hospital Gasthuisberg, Leuven, Belgium (CW, KL); Copenhagen General Population Study, Herlev Hospital, Copenhagen, Denmark (SEB, BGN, SFN); Department of Clinical Biochemistry, Herlev Hospital, Copenhagen University Hospital, Denmark (SEB, BGN, SFN); Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark (SEB, BGN); Department of Breast Surgery, Herlev Hospital, Copenhagen University Hospital, Denmark (HF); Division of Cancer Epidemiology, German Cancer Research Center (Deutsches Krebsforschungszentrum), Heidelberg, Germany (JCC, AR, PS, DC, AHü, RK, MB); Department of Cancer Epidemiology/Clinical Cancer Registry and Institute for Medical Biometrics and Epidemiology, University Clinic Hamburg-Eppendorf, Hamburg, Germany (DFJ); Department of Oncology, Helsinki University Central Hospital, Helsinki, Finland (CBl); Department of Clinical Genetics, Helsinki University Central Hospital, Helsinki, Finland (KA); Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN (FJC); Department of Health Sciences Research, Mayo Clinic, Rochester, MN (JEO, CV); Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada (ILA); Ontario Cancer Genetics Network, Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada (ILA, GG); Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada (JAK); Prosserman Centre for Health Research, Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada (JAK); Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada (AMM); Laboratory Medicine Program, University Health Network, Toronto, Ontario, Canada (AMM); Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA (CAH, BEH, FS); University of Hawaii Cancer Centre, Honolulu, HI (LLM); Centre for Epidemiology and Biostatistics, Melbourne School of Population Health, the University of Melbourne, Melbourne, Australia (JLH, CA, GGG, RLM); Genetic Epidemiology Laboratory, Department of Pathology, the University of Melbourne, Melbourne, Australia (HT, MCS); Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK (AC, MWRR); Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, UK (SSC); Cancer Epidemiology Centre, Cancer Council Victoria, Melbourne, Australia (GGG, RLM); Anatomical Pathology, the Alfred Hospital, Melbourne, Australia (CM); Laboratory of Cancer Genetics and Tumor Biology, Department of Clinical Chemistry and Biocenter Oulu, University of Oulu, Oulu, Finland (RW); Laboratory of Cancer Genetics and Tumor Biology, Northern Finland Laboratory Centre NordLab, Oulu, Finland (KP); Department of Oncology, Oulu University Hospital, University of Oulu, Oulu, Finland (AJV); Department of Surgery, Oulu University Hospital, University of Oulu, Oulu, Finland (MG); Department of Medical Oncology, Family Cancer Clinic, Erasmus MC Cancer Institute, Rotterdam, the Netherlands (MJH, AHo, JWMM, AMWvdO); Department of Obstetrics and Gynecology, University of Heidelberg, Heidelberg, Germany (FM, AS, RY, BB); National Center for Tumor Diseases, University of Heidelberg, Heidelberg, Germany (FM, AS); Molecular Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (BB); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (JF, SJC); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD (JF); Core Genotyping Facility, Frederick National Laboratory for Cancer Research, Gaithersburg, MD (SJC); Department of Cancer Epidemiology and Prevention, M. Sklodowska-Curie Memorial Cancer Center & Institute of Oncology, Warsaw, Poland (JoL); Division of Cancer Studies, National Institute for Health Research, Comprehensive Biomedical Research Centre, Guy’s & St. Thomas’ NHS Foundation Trust in partnership with King’s College London, London, UK (EJS); Wellcome Trust Centre for Human Genetics and Oxford NIHR Biomedical Research Centre, University of Oxford, UK (IT); Clinical Science Institute, University Hospital Galway, Galway, Ireland (MJK, NM); Division of Clinical Epidemiology and Aging Research, German Cancer Research Center, Heidelberg, Germany (HB, AKD, VA); German Cancer Consortium (DKTK), Heidelberg, Germany (HB, AKD); Saarland Cancer Registry, Saarbrücken, Germany (BH); Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland (AM, VMK, JMH); School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, University of Eastern Finland, Kuopio, Finland (AM, VMK, JMH); Biocenter Kuopio, Cancer Center of Eastern Finland, Kuopio University Hospital, Kuopio, Finland (VKa); School of Medicine, Institute of Clinical Medicine, Oncology, University of Eastern Finland, Kuopio, Finland (VKa); Department of Human Genetics & Department of Pathology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (PD); Department of Surgical Oncology, Leiden University Medical Center, 2300 RC Leiden, the Netherlands (RAEMT); Family Cancer Clinic, Department of Medical Oncology, Erasmus MC-Daniel den Hoed Cancer Centrer, Rotterdam, the Netherlands (CS); Unit of Molecular Bases of Genetic Risk and Genetic Testing, Department of Preventive and Predictive Medicine, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy (PR); IFOM, Fondazione Istituto FIRC di Oncologia Molecolare, Milan, Italy (PP, PM); Division of Cancer Prevention and Genetics, Istituto Europeo di Oncologia, Milan, Italy (BB); Cogentech Cancer Genetic Test Laboratory, Milan, Italy (PM); David Geffen School of Medicine, Department of Medicine, Division of Hematology and Oncology, University of California at Los Angeles, CA (PAF); Department of Gynecology and Obstetrics, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (PAF, MWB, AHe); Institute of Human Genetics; University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nuremberg, Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany (ABE); Western Sydney and Nepean Blue Mountains Local Health Districts, Westmead Millennium Institute for Medical Research, University of Sydney, Sydney, Australia (RB); Peter MacCallum Cancer Center, Melbourne, Australia (kConFab Investigators); the University of Melbourne, Melbourne, Australia (KAP); Division of Cancer Medicine, Peter MacCallum Cancer Centre, Melbourne, Australia (KAP); Centro de Investigación en Red de Enfermedades Raras, Valencia, Spain (JBen); Human Genetics Group, Human Cancer Genetics Program, Spanish National Cancer Research Centre, Madrid, Spain (JBen); Servicio de Oncología Médica, Hospital Universitario La Paz, Madrid, Spain (MPZ); Servicio de Cirugía General y Especialidades, Hospital Monte Naranco, Oviedo, Spain (JIAP); Servicio de Anatomía Patológica, Hospital Monte Naranco, Oviedo, Spain (PM); Department of Genetics and Pathology, Pomeranian Medical University, Szczecin, Poland (AJ, JL, KJB, KD); Molecular Genetics of Breast Cancer, German Cancer Research Center, Heidelberg, Germany (UH, MK); Frauenklinik der Stadtklinik Baden-Baden, Baden-Baden, Germany (HUU); Institute of Pathology, Städtisches Klinikum Karlsruhe, Karlsruhe, Germany (TR); Department of Oncology - Pathology, Karolinska Institutet, Stockholm, Sweden (SM); Department of Genetics, Institute for Cancer Research, Oslo University Hospital, Radiumhospitalet, Oslo, Norway (VKr, SN); Faculty of Medicine (Faculty Division Ahus), University of Oslo, Norway (VKr, SN); Genomic Medicine, Manchester Academic Health Science Centre, University of Manchester, Central Manchester Foundation Trust, St. Mary’s Hospital, Manchester, UK (DGE); Cambridge Breast Research Unit and NIHR Cambridge Biomedical Research Centre, University of Cambridge, Department of Oncology, Cambridge, UK (JEA, HME, CCa); Cambridge Experimental Cancer Medicine Centre, Cambridge, UK (JEA, HME, CCa); Warwick Clinical Trials Unit, University of Warwick, UK (LH, JAD); Cancer Research UK Clinical Trials Unit, Institute for Cancer Studies, the University of Birmingham, Edgbaston, Birmingham, UK (SB); Early Detection Research Group, Division of Cancer Prevention National Cancer Institute Bethesda, MD (CBe); Department of Biology, University of Pisa, Pisa, Italy (DC); Epidemiology Research Program, American Cancer Society, Atlanta, GA (WRD, SMG, MMG); Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, MA (SH); Division of Biostatistics and Epidemiology, University of Massachusetts-Amherst School of Public Health and Health Sciences, Amherst, MA (SH); Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD (RNH, MJM); Department of Nutrition, Harvard School of Public Health, Boston, MA (WW); Genomic Epidemiology Group, German Cancer Research Center, Heidelberg, Germany (FC); Breast Cancer Functional Genomics Laboratory, Cancer Research UK Cambridge Institute, University of Cambridge, Li Ka Shing Centre, UK (SFC, CCa); Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, the Institute of Cancer Research, London, UK (MGC); Division of Genetics and Epidemiology, Institute of Cancer Research, Sutton, Surrey, UK (MGC, NR); Faculty of Medicine, University of Southampton, UK (DME).

retrieved by IIS PMC parser from one of the PMC XML resources.

I've just created same issue in IIS https://github.com/openaire/iis/issues/663 to bypass this problem.

dtkaczyk commented 8 years ago

Fixed in commit https://github.com/CeON/CERMINE/commit/edf76ca2a4ad074d84593ac650646e5b2c2f9f9d by setting max affiliation length for the parser.