fetch study metadata - Githubissues

cmungall commented 4 years ago

This often has richer text to be used for NLP

wdduncan commented 4 years ago

@cmungall should we merge this ticket with #7 ? I am already generating a lot of data for that ticket.

wdduncan commented 4 years ago

merging the #7 and closing.

cmungall commented 4 years ago

@wdduncan can you assign @hrshdhgd

cmungall commented 4 years ago

https://ftp.ncbi.nlm.nih.gov/bioproject/bioproject.xml

The project db has info on all studies. it also links to samples e.g.

    <LocusTagPrefix biosample_id="SAMN11044051">E0Y81</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044052">E0Y82</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044053">E0Y83</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044054">E0Y84</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044055">E0Y85</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044056">E0Y86</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044057">E0Y87</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044058">E0Y88</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044059">E0Y89</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044060">E0Y90</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044061">E0Y91</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044062">E0Y92</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044063">E0Y93</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044064">E0Y94</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044065">E0Y95</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044066">E0Y96</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044067">E0Y97</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044068">E0Y98</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044069">E0Y99</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044070">E0Z00</LocusTagPrefix>

cmungall commented 4 years ago

Example text to be mined

<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA13694" archive="NCBI" id="13694"/>
      </ProjectID>
      <ProjectDescr>
        <Name>marine metagenome</Name>
        <Title>Metagenomic analysis of marine microbes isolated during the Global Ocean Sampling Expedition</Title>
        <Description>A broad objective of the Global Ocean Sampling (GOS) Expedition is to assess the genetic diversity in marine microbial communities and understand their role in fundamental processes in nature. Marine microbes influence the cycling of carbon (and other elements) in the world's oceans, acting as a biological conduit that transports carbon dioxide from the surface to the deep oceanic realms. By sequestering carbon from the atmosphere, marine microorganisms (eukaryotes, prokaryotes and viruses) may significantly affect global climate. However, we know little about the physiological processes and complex interactions of communities that impact global carbon cycles and ocean productivity, and our attempts to study their activities are limited by our inability to culture the vast majority of them. These uncultured marine microorganisms are also a rich repository of novel genes and molecular structures that have potential in the development of biocatalysts for industrial and medical applications.
&lt;p&gt;
One avenue of exploration is to sequence the genomes of marine microbes using a metagenomics approach. In 2003, the J. Craig Venter Institute conducted a whole environment shotgun sequencing project to study marine microorganisms in the nutrient-poor Sargasso Sea near Bermuda. This study revealed a remarkable breadth and depth of microbial diversity - about 1,800 different prokaryotic species encoding over 1.2 million genes were discovered. Notably, this study expanded our knowledge of ocean photobiology, microbial diversity and evolution. Results from the pilot study were reported in Science in 2004.
&lt;p&gt;
This pilot study served as the springboard for launching a more comprehensive survey of the bacterial, archaeal and viral diversity of the world's oceans. A global circumnavigation aboard the Sorcerer II sailing yacht began in August 2003, starting in Halifax, Canada and samples were collected at sites along the U.S. east coast, Gulf of Mexico, Galapagos Islands, central and south Pacific Oceans, Australia, Indian Ocean, South Africa, across the Atlantic back to the U.S., and was completed in January 2006. An initial analysis of the microbial sequence data from the first leg of the trip - Halifax to the Galapagos Islands was reported in a special issue of PLoS Biology on Ocean Meganomics in March 2007 (see &lt;a href="http://collections.plos.org/plosbiology/gos-2007"&gt;http://collections.plos.org/plosbiology/gos-2007&lt;/a&gt;). Additional data from the Indian Ocean was released in March 2008.  Shotgun sequencing and deep sequencing of 16S and 18S rRNA is currently underway on additional samples.
&lt;p&gt;
Collectively these studies have produced the largest catalogue of genes to date from thousands of new species, with no apparent slowing of the rate of discovery of novel gene families. These data have potentially far-reaching implications for biological energy production, bioremediation, and creating solutions for reduction/management of greenhouse gas levels in our biosphere. The complete set of data and bioinformatic analysis tools from the &lt;a href="http://web.camera.calit2.net/cameraweb/gwt/org.jcvi.camera.web.gwt.download.BrowseProjectsPage/BrowseProjectsPage.oa?projectSymbol=CAM_PROJ_GOS"&gt;GOS project&lt;/a&gt; is available through the &lt;a href="http://camera.calit2.net/"&gt;CAMERA&lt;/a&gt; metagenomics repository.  These studies have been supported by The Department of Energy, The Gordon and Betty Moore Foundation, and the J. Craig Venter Institute.

&lt;p&gt;
The WGS project and sequences deposited into the Trace Archive can be found using the Project data link.</Description>

hrshdhgd commented 4 years ago

I believe this is my first stab at the study description xml parsing to output a tsv file. The file has 5 columns namely:

['StudyId', 'Name', 'Title', 'Description', 'BiosampleId'].

The 'Description' column will source the NLP pipeline to get us potential supplemental information.

INCATools / biosample-analysis

fetch study metadata #1