JULIELab / gepi

GePI (GEne - Protein Interactions) is a web portal for quick and convenient access to gene - protein interaction mentions automatically extracted from the biomedical literature, i.e. PubMed and PubMed Central (Open Access Subset).
GNU General Public License v3.0
1 stars 0 forks source link

Problematic Medline Document: 23700993 #10

Closed fmatthies closed 6 years ago

fmatthies commented 7 years ago
 <MedlineCitation Status="MEDLINE" Owner="NLM">
        <PMID Version="1">23700993</PMID>
        <DateCreated>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateCreated>
        <DateCompleted>
            <Year>2014</Year>
            <Month>02</Month>
            <Day>26</Day>
        </DateCompleted>
        <DateRevised>
            <Year>2013</Year>
            <Month>08</Month>
            <Day>06</Day>
        </DateRevised>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Electronic">1875-6697</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <Volume>9</Volume>
                    <Issue>2</Issue>
                    <PubDate>
                        <Year>2013</Year>
                        <Month>Jun</Month>
                    </PubDate>
                </JournalIssue>
                <Title>Current computer-aided drug design</Title>
                <ISOAbbreviation>Curr Comput Aided Drug Des</ISOAbbreviation>
            </Journal>
            <ArticleTitle>Molecular design and QSARs/QSPRs with molecular descriptors family.</ArticleTitle>
            <Pagination>
                <MedlinePgn>195-205</MedlinePgn>
            </Pagination>
            <Abstract>
                <AbstractText>The aim of the present paper is to present the methodology of the molecular descriptors family (MDF) as an integrative tool in molecular modeling and its abilities as a multivariate QSAR/QSPR modeling tool. An algorithm for extracting useful information from the topological and geometrical representation of chemical compounds was developed and integrated to calculate MDF members. The MDF methodology was implemented and the software is available online (http://l.academicdirect.org/Chemistry/SARs/MDF_SARs/). This integrative tool was developed in order to maximize performance, functionality, efficiency and portability. The MDF methodology is able to provide reliable and valid multiple linear regression models. Furthermore, in many cases, the MDF models were better than the published results in the literature in terms of correlation coefficients (statistically significant Steiger's Z test at a significance level of 5%) and/or in terms of values of information criteria and Kubinyi function. The MDF methodology developed and implemented as a platform for investigating and characterizing quantitative relationships between the chemical structure and the activity/property of active compounds was used on more than 50 study cases. In almost all cases, the methodology allowed obtaining of QSAR/QSPR models improved in explanatory power of structure-activity and structure-property relationships. The algorithms applied in the computation of geometric and topological descriptors (useful in modeling physicochemical or biological properties of molecules) and those used in searching for reliable and valid multiple linear regression models certain enrich the pool of low-cost low-time drug design tools.</AbstractText>
            </Abstract>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Bolboacă</LastName>
                    <ForeName>Sorana D</ForeName>
                    <Initials>SD</Initials>
                    <AffiliationInfo>
                        <Affiliation>Department of Medical Informatics and Biostatistics, Iuliu Ha􀀅ieganu University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj, Romania.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Jäntschi</LastName>
                    <ForeName>Lorentz</ForeName>
                    <Initials>L</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Diudea</LastName>
                    <ForeName>Mircea V</ForeName>
                    <Initials>MV</Initials>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
                <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
            </PublicationTypeList>
        </Article>
        <MedlineJournalInfo>
            <Country>United Arab Emirates</Country>
            <MedlineTA>Curr Comput Aided Drug Des</MedlineTA>
            <NlmUniqueID>101265750</NlmUniqueID>
            <ISSNLinking>1573-4099</ISSNLinking>
        </MedlineJournalInfo>
        <CitationSubset>IM</CitationSubset>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName UI="D000465" MajorTopicYN="N">Algorithms</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D015195" MajorTopicYN="N">Drug Design</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D021281" MajorTopicYN="Y">Quantitative Structure-Activity Relationship</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName UI="D012984" MajorTopicYN="N">Software</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
    </MedlineCitation>
    <PubmedData>
        <History>
            <PubMedPubDate PubStatus="received">
                <Year>2013</Year>
                <Month>03</Month>
                <Day>10</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="revised">
                <Year>2012</Year>
                <Month>10</Month>
                <Day>26</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="accepted">
                <Year>2013</Year>
                <Month>04</Month>
                <Day>27</Day>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="entrez">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="pubmed">
                <Year>2013</Year>
                <Month>5</Month>
                <Day>25</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
            <PubMedPubDate PubStatus="medline">
                <Year>2014</Year>
                <Month>2</Month>
                <Day>27</Day>
                <Hour>6</Hour>
                <Minute>0</Minute>
            </PubMedPubDate>
        </History>
        <PublicationStatus>ppublish</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">23700993</ArticleId>
            <ArticleId IdType="pii">CCADD-EPUB-20130514-4</ArticleId>
        </ArticleIdList>
    </PubmedData>
fmatthies commented 7 years ago
com.ximpleware.ParseException: Error in text content: Invalid char in text content Line Number: 45 Offset: 89
        at com.ximpleware.VTDGen.handleOtherTextChar(VTDGen.java:5160) ~[vtd-xml-2.11.jar:na]
        at com.ximpleware.VTDGen.parse(VTDGen.java:2474) ~[vtd-xml-2.11.jar:na]
        at de.julielab.jcore.reader.xmlmapper.mapper.XMLMapper.parse(XMLMapper.java:95) ~[jcore-xml-mapper-2.2.0.jar:na]
        at de.julielab.jules.reader.DBMedlineReader.getNext(DBMedlineReader.java:199) [jules-medline-reader-3.0.2.jar:na]
        at org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.readNext(ArtifactProducer.java:494) [uimaj-cpe-2.5.0.jar:2.5.0]
        at org.apache.uima.collection.impl.cpm.engine.ArtifactProducer.run(ArtifactProducer.java:711) [uimaj-cpe-2.5.0.jar:2.5.0]
Department of Medical Informatics and Biostatistics, Iuliu Ha􀀅ieganu University of Medicine and Pharmacy Cluj-Napoca, 6 Louis Pasteur, 400349 Cluj, Romania.
khituras commented 7 years ago

The issue is as follows: We VTD-XML to import Medline XML into the database. Since VTD cannot deal with Unicode Supplementary characters properly, the resulting XML contains invalid characters (control sequences or whatever). What happens is that the supplementary characters - which start at codepoint U+10000 and thus need more than 16bit - are represented via surrogate pairs. Each surrogate has 16bit. VTD only uses the lower 16bit. This is why this above error is thrown when trying to parse such corrupted character streams from the database. This means we have to resolve this issue BEFORE importing into the database. A solution to the issue could be to replace such characters with a placeholder like ###UNICODE_SUPP_CHAR_XXX### with XXX being some kind of unique identifier like a counter. Then, a file could be written that maps the unique identifiers to the original supplementary character. After all work with VTD is done, the placeholder would be replaced by the original character.

khituras commented 7 years ago

We have currently an internal VTD-XML version which I put together following the instructions of the VTD-XML author. After doing this, the Unicode jUnit test put up to prove the wrong behavior worked fine. I put together a new version of the julie-medline-manager using this version of VTD-XML and started importing of Medline XML from scratch. All pipelines should be updated to this version:

For julie-medline-manager:

<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>julie-medline-manager</artifactId>
    <version>1.1.0-SNAPSHOT</version>
</dependency>

If the julie-xml-tools are directly used, then:

<dependency>
    <groupId>de.julielab</groupId>
    <artifactId>julie-xml-tools</artifactId>
    <version>0.3.2-SNAPSHOT</version>
</dependency>

We then have to check manually for the documents in question whether everything is fine, then.

SchSascha commented 6 years ago

@khituras What is the status here?

khituras commented 6 years ago

The newest verion of VTD has the fix included. This issue is fixed.