cannin / ihop-reach

A web application to access biological data extracted from biomedical literature.
https://reach.nrnb-docker.ucsd.edu
GNU Lesser General Public License v3.0
4 stars 4 forks source link

Absence of Year in PubMed XML files #43

Open RohitChattopadhyay opened 5 years ago

RohitChattopadhyay commented 5 years ago

Related to #41

The date of publication for the great majority of records will reside in the separate date-related elements within as shown above and in these cases the record will not contain . The date of publication of the article will be found in when parsing for the separate fields is not possible; i.e., cases where dates do not fit the Year, Month, or Day pattern.

The above line is taken from PubMed XML Schema Defination.

The statement suggests that the XML files will have the publication date in \<MedlineDate> tag if parsing the date in the article is not possible.

Following Gist shows example XML files

  1. \<PubDate> having separate date-related elements
  2. \<PubDate> having \<MedlineDate>

Some examples of MedlineDate content:

1998 Dec-1999 Jan2000 Spring
2000 Spring-Summer
2000 Nov-Dec
2000 Dec 23- 30
cannin commented 5 years ago

Can you switch to extracting the year from:

<PubmedData>
<PubMedPubDate PubStatus="pubmed">
            <Year>1947</Year>

Extracting the MedlineDate will mean parsing the date string. We want to avoid this.

RohitChattopadhyay commented 5 years ago

The PR https://github.com/sorgerlab/indra/pull/902 solves the problem of inconsistency in Date.

RohitChattopadhyay commented 5 years ago

<PubMedPubDate PubStatus="pubmed"> is absent in some records of file pubmed19n0972.xml FTP link for the file: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed19n0972.xml.gz