NCBI-Hackathons / EDirectCookbook

MIT License
159 stars 53 forks source link

The efetch command reported BioSamples’ Attributes completely different from those displayed at the NCBI website #45

Open lauraht opened 5 years ago

lauraht commented 5 years ago

I was trying to use efetch to obtain the Attributes of a BioSample record, but I found that for some BioSample records, the Attributes reported in the xml are completely different from those displayed at the NCBI website. And the BioSample Id reported in the xml is different from the BioSample Id specified in the efetch command. I use the following command to get the xml of a BioSample record: efetch -db biosample -id SAMEA5244969 -format xml

Example 1: for BioSample SAMEA5244969, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/10858554 However, the efetch command reported the following xml:

<?xml version="1.0" ?>
<BioSampleSet>
   <BioSample access="public" publication_date="2016-06-04T00:00:00.000" last_update="2017-01-23T16:11:22.000" submission_date="2016-06-14T11:27:28.390" id="5244969" accession="SAMEA4457316">   
      <Ids>     
         <Id db="BioSample" is_primary="1">SAMEA4457316</Id>   
      </Ids>   
      <Description>     
         <Title>Sample from Homo sapiens</Title>     
         <Organism taxonomy_id="9606" taxonomy_name="Homo sapiens">       
            <OrganismName>Homo sapiens</OrganismName>     
         </Organism>   
      </Description>   
      <Owner>     
         <Name>EBI</Name>   
      </Owner>   
      <Models>     
         <Model>Generic</Model>   
      </Models>   
      <Package display_name="Generic">Generic.1.0</Package>   
      <Attributes>     
         <Attribute attribute_name="Sample Name" harmonized_name="sample_name" display_name="sample name">source 4</Attribute>     
         <Attribute attribute_name="Sex" harmonized_name="sex" display_name="sex">male</Attribute>     
         <Attribute attribute_name="disease state" harmonized_name="disease" display_name="disease">normal</Attribute>     
         <Attribute attribute_name="organism part" harmonized_name="tissue" display_name="tissue">colon</Attribute>     
         <Attribute attribute_name="specimen with known storage state">frozen specimen</Attribute>  
      </Attributes>   
      <Status status="live" when="2016-06-14T11:27:28.393"/> 
   </BioSample> 
</BioSampleSet>

The Attributes in this xml are completely different from those displayed at the NCBI website. And the reported BioSample Id (SAMEA4457316) in this xml is different from the BioSample Id (SAMEA5244969) specified in the efetch command.

Example 2: for BioSample SAMEA104565009, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/11349430 However, the efetch command reported the following xml:

<?xml version="1.0" ?>

This xml does not contain any elements even though a list of Attributes are displayed at the NCBI website.

Example 3: for BioSample SAMEA5099860, the NCBI website displays the Attributes as shown at https://www.ncbi.nlm.nih.gov/biosample/10655621 However, the efetch command reported the following xml:

<?xml version="1.0" ?>
<BioSampleSet>
   <BioSample access="public" publication_date="2014-10-22T00:00:00.000" last_update="2016-10-25T08:32:28.000" submission_date="2016-05-19T19:48:00.303" id="5099860" accession="SAMEA3067264">   
      <Ids>     
         <Id db="BioSample" is_primary="1">SAMEA3067264</Id>   
      </Ids>   
      <Description>     
         <Title>Sample from Homo sapiens</Title>     
         <Organism taxonomy_id="9606" taxonomy_name="Homo sapiens">  
            <OrganismName>Homo sapiens</OrganismName>     
         </Organism>     
         <Comment>       
            <Paragraph>ExAC_v0.1_Sample_52281</Paragraph>     
         </Comment>   
      </Description>   
      <Owner>     
         <Name>EBI</Name>   
      </Owner>   
      <Models>     
         <Model>Generic</Model>   
      </Models>   
      <Package display_name="Generic">Generic.1.0</Package>   
      <Attributes>     
         <Attribute attribute_name="Sample Name" harmonized_name="sample_name" display_name="sample name">52281</Attribute>   
      </Attributes>   
      <Status status="live" when="2016-05-19T19:48:00.305"/> 
   </BioSample> 
</BioSampleSet>

The Attributes in this xml are completely different from those displayed at the NCBI website. And the reported BioSample Id (SAMEA3067264) in this xml is different from the BioSample Id (SAMEA5099860) specified in the efetch command.

I was wondering if you have some ideas about why the efetch command did not work correctly for the above BioSamples?

I’d greatly appreciate your help!

Thank you very much!

lwagnerdc commented 5 years ago

Hmm, it looks like efetch has stripped the noninteger part of your id, appears to just understand entrez numeric IDs rather than BioSample or SRA accessions. The accessions are indexed, so an extra step: esearch -db biosample -q ERS3052368 | efetch -format xml

lauraht commented 5 years ago

Thank you so much for your help! I really appreciate it! I use the BioSample accession in “-q” (instead of the SRA accession) as below: esearch -db biosample -q SAMEA5244969 | efetch -format xml and it works as expected. Thanks again!