NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

do we need an SRA importer? #207

Open bradfordcondon opened 5 years ago

bradfordcondon commented 5 years ago

casey was loading PRJDB4532. no biosample is loaded because its not linked to the project in the returned XML.

Maybe we want to be able to load the SRA experiment? And by loading that you'd load the linked project and biosample?

here's the XML:


<?xml version="1.0" ?>
<EXPERIMENT_PACKAGE_SET>
  <EXPERIMENT_PACKAGE>
    <EXPERIMENT alias="DRX049157" center_name="NIFTS" accession="DRX049157">
      <IDENTIFIERS>
        <PRIMARY_ID>DRX049157</PRIMARY_ID>
      </IDENTIFIERS>
      <TITLE>454 GS FLX+ sequencing of SAMD00046318</TITLE>
      <STUDY_REF refname="DRP003980" refcenter="NIFTS" accession="DRP003980">
        <IDENTIFIERS>
          <PRIMARY_ID>DRP003980</PRIMARY_ID>
          <EXTERNAL_ID namespace="BioProject" label="BioProject ID">PRJDB4532</EXTERNAL_ID>
        </IDENTIFIERS>
      </STUDY_REF>
      <DESIGN><DESIGN_DESCRIPTION/>
        <SAMPLE_DESCRIPTOR refname="DRS057276" refcenter="NIFTS" accession="DRS057276">
          <IDENTIFIERS>
            <PRIMARY_ID>DRS057276</PRIMARY_ID>
            <EXTERNAL_ID namespace="BioSample" label="BioSample ID">SAMD00046318</EXTERNAL_ID>
          </IDENTIFIERS>
        </SAMPLE_DESCRIPTOR>
        <LIBRARY_DESCRIPTOR><LIBRARY_NAME/>
          <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
          <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
          <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
          <LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL/></LIBRARY_DESCRIPTOR>
        <SPOT_DESCRIPTOR>
          <SPOT_DECODE_SPEC>
            <SPOT_LENGTH>677</SPOT_LENGTH>
            <READ_SPEC>
              <READ_INDEX>0</READ_INDEX>
              <READ_CLASS>Application Read</READ_CLASS>
              <READ_TYPE>Forward</READ_TYPE>
              <BASE_COORD>1</BASE_COORD>
            </READ_SPEC>
          </SPOT_DECODE_SPEC>
        </SPOT_DESCRIPTOR>
      </DESIGN>
      <PLATFORM>
        <LS454>
          <INSTRUMENT_MODEL>454 GS FLX+</INSTRUMENT_MODEL>
        </LS454>
      </PLATFORM>
    </EXPERIMENT>
    <SUBMISSION lab_name="Genome Unit, NARO Institute of Fruit Tree Science" alias="DRA004360" center_name="NIFTS" accession="DRA004360">
      <IDENTIFIERS>
        <PRIMARY_ID>DRA004360</PRIMARY_ID>
      </IDENTIFIERS>
    </SUBMISSION>
    <Organization type="center">
      <Name abbr="NIFTS">NIFTS</Name>
    </Organization>
    <STUDY center_name="NIFTS" alias="DRP003980" accession="DRP003980">
      <IDENTIFIERS>
        <PRIMARY_ID>DRP003980</PRIMARY_ID>
        <EXTERNAL_ID namespace="BioProject" label="primary">PRJDB4532</EXTERNAL_ID>
      </IDENTIFIERS>
      <DESCRIPTOR>
        <STUDY_TITLE>Genome sequencing of mango (Mangifera indica) cultivar ''Irwin''</STUDY_TITLE><STUDY_TYPE existing_study_type="Whole Genome Sequencing"/>
        <STUDY_ABSTRACT>This genome was sequenced to search and construct mango genomic DNA markers. Cultivar ''Irwin'' is leading cultivar in Japan.</STUDY_ABSTRACT>
      </DESCRIPTOR>
    </STUDY>
    <SAMPLE alias="SAMD00046318" accession="DRS057276">
      <IDENTIFIERS>
        <PRIMARY_ID>DRS057276</PRIMARY_ID>
        <EXTERNAL_ID namespace="BioSample">SAMD00046318</EXTERNAL_ID>
      </IDENTIFIERS>
      <TITLE>Irwin</TITLE>
      <SAMPLE_NAME>
        <TAXON_ID>29780</TAXON_ID>
        <SCIENTIFIC_NAME>Mangifera indica</SCIENTIFIC_NAME>
      </SAMPLE_NAME>
      <SAMPLE_ATTRIBUTES>
        <SAMPLE_ATTRIBUTE>
          <TAG>sample_name</TAG>
          <VALUE>HXXQCLF01</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>cultivar</TAG>
          <VALUE>Irwin</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>biomaterial_provider</TAG>
          <VALUE>Okinawa Prefectural Agricultural Research Center</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>collection_date</TAG>
          <VALUE>2013</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>env_biome</TAG>
          <VALUE>subtropical</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>env_feature</TAG>
          <VALUE>farm</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>env_material</TAG>
          <VALUE>soil</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>geo_loc_name</TAG>
          <VALUE>Japan</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>lat_lon</TAG>
          <VALUE>26.1108 N 127.6861 E</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>project_name</TAG>
          <VALUE>DNA marker identification from DNA sequences</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>isol_growth_condt</TAG>
          <VALUE>23341750</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>num_replicons</TAG>
          <VALUE>20</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>estimated_size</TAG>
          <VALUE>400 Mbp</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>ploidy</TAG>
          <VALUE>diploid</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>propagation</TAG>
          <VALUE>asexual</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>health_disease_stat</TAG>
          <VALUE>health</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>trophic_level</TAG>
          <VALUE>photosynthetic</VALUE>
        </SAMPLE_ATTRIBUTE>
        <SAMPLE_ATTRIBUTE>
          <TAG>BioSampleModel</TAG>
          <VALUE>MIGS.eu</VALUE>
        </SAMPLE_ATTRIBUTE>
      </SAMPLE_ATTRIBUTES>
    </SAMPLE>
    <Pool>
      <Member member_name="" accession="DRS057276" sample_name="SAMD00046318" sample_title="Irwin" spots="1513701" bases="1650906230" tax_id="29780" organism="Mangifera indica">
        <IDENTIFIERS>
          <PRIMARY_ID>DRS057276</PRIMARY_ID>
          <EXTERNAL_ID namespace="BioSample">SAMD00046318</EXTERNAL_ID>
        </IDENTIFIERS>
      </Member>
    </Pool>
    <RUN_SET>
      <RUN alias="DRR054308" center_name="NIFTS" accession="DRR054308" total_spots="724319" total_bases="790234730" size="1854856875" load_done="true" published="2018-01-10 04:26:33" is_public="true" cluster_name="public" static_data_available="1">
        <IDENTIFIERS>
          <PRIMARY_ID>DRR054308</PRIMARY_ID>
        </IDENTIFIERS>
        <TITLE>454 GS FLX+ sequencing of SAMD00046318</TITLE><EXPERIMENT_REF refname="DRX049157" refcenter="NIFTS" accession="DRX049157"/>
        <Pool>
          <Member member_name="" accession="DRS057276" sample_name="SAMD00046318" sample_title="Irwin" spots="724319" bases="790234730" tax_id="29780" organism="Mangifera indica">
            <IDENTIFIERS>
              <PRIMARY_ID>DRS057276</PRIMARY_ID>
              <EXTERNAL_ID namespace="BioSample">SAMD00046318</EXTERNAL_ID>
            </IDENTIFIERS>
          </Member>
        </Pool>
        <Statistics nreads="1" nspots="724319"><Read index="0" count="724319" average="1091.00" stdev="197.60"/></Statistics>
        <Bases cs_native="false" count="790234730"><Base value="A" count="252681330"/><Base value="C" count="133152082"/><Base value="G" count="139436462"/><Base value="T" count="253440531"/><Base value="N" count="11524325"/></Bases>
      </RUN>
      <RUN alias="DRR054307" center_name="NIFTS" accession="DRR054307" total_spots="789382" total_bases="860671500" size="2009178726" load_done="true" published="2018-01-10 04:26:33" is_public="true" cluster_name="public" static_data_available="1">
        <IDENTIFIERS>
          <PRIMARY_ID>DRR054307</PRIMARY_ID>
        </IDENTIFIERS>
        <TITLE>454 GS FLX+ sequencing of SAMD00046318</TITLE><EXPERIMENT_REF refname="DRX049157" refcenter="NIFTS" accession="DRX049157"/>
        <Pool>
          <Member member_name="" accession="DRS057276" sample_name="SAMD00046318" sample_title="Irwin" spots="789382" bases="860671500" tax_id="29780" organism="Mangifera indica">
            <IDENTIFIERS>
              <PRIMARY_ID>DRS057276</PRIMARY_ID>
              <EXTERNAL_ID namespace="BioSample">SAMD00046318</EXTERNAL_ID>
            </IDENTIFIERS>
          </Member>
        </Pool>
        <Statistics nreads="1" nspots="789382"><Read index="0" count="789382" average="1090.31" stdev="191.37"/></Statistics>
        <Bases cs_native="false" count="860671500"><Base value="A" count="277391866"/><Base value="C" count="144641177"/><Base value="G" count="150659961"/><Base value="T" count="276069965"/><Base value="N" count="11908531"/></Bases>
      </RUN>
    </RUN_SET>
  </EXPERIMENT_PACKAGE>
</EXPERIMENT_PACKAGE_SET>
bradfordcondon commented 5 years ago

Unlike assembly, SRAs are typically one of many records that are all grouped together.

For a single SRA (which is defined as anWe have a run, a library, a sample (biosample), a study.

https://www.ncbi.nlm.nih.gov/sra/SRX5431186[accn]

https://www.ncbi.nlm.nih.gov/Traces/study/?WebEnv=NCID_1_23786308_130.14.22.76_5555_1551459537_3256482250_0MetA0_S_HStore&query_key=5

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP187033

So SRP187033 (the project study) is part of project PRJNA552953. it consists of 6 experiments and 6 runs. https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP187033 the runs are SRR8643699..704, with diferent biosoamples and experiments (SRX5441940...) as well.

SRA defines what it calls analysis for example https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?analysis=DRZ000001 . this is part of a study (in this case DRP000072).

magno example

Project PRJDB4532 HAS SRA EXPERIMENT record DRX049157 HAS

Study: DRP003980

"SRA Sample" DRS057276 (not linked rom the SRA record but can be found associated with the sample with a broken linkout).