NAL-i5K / tripal_eutils

ncbi loader via the eutils interface
GNU General Public License v3.0
4 stars 3 forks source link

assembly: which xml tags should be added as properties/_cvterms? #54

Closed bradfordcondon closed 5 years ago

bradfordcondon commented 5 years ago

looking at https://github.com/NAL-i5K/tripal_eutils/tree/master/examples/assembly for examples.

below are example tags from 751381 that arent dealt with via dbxrefs/linked records

      <AssemblyType>haploid</AssemblyType>
      <AssemblyClass>haploid</AssemblyClass>
      <AssemblyStatus>Scaffold</AssemblyStatus>
      <WGS>LVXX01</WGS>
 <Coverage>99</Coverage>
      <PartialGenomeRepresentation>false</PartialGenomeRepresentation>
      <Primary>4681358</Primary>
      <AssemblyDescription/>
      <ReleaseLevel>Major</ReleaseLevel>
      <ReleaseType>Major</ReleaseType>
      <AsmReleaseDate_GenBank>2016/06/01 00:00</AsmReleaseDate_GenBank>
      <AsmReleaseDate_RefSeq>2017/07/14 00:00</AsmReleaseDate_RefSeq>
      <SeqReleaseDate>2016/06/01 00:00</SeqReleaseDate>
      <AsmUpdateDate>2017/07/19 00:00</AsmUpdateDate>
      <SubmissionDate>2016/06/01 00:00</SubmissionDate>
      <LastUpdateDate>2017/07/19 00:00</LastUpdateDate>
      <SubmitterOrganization>Rubber Research Institute</SubmitterOrganization>
      <RefSeq_category>representative genome</RefSeq_category>
      <AnomalousList>
      </AnomalousList>
      <ExclFromRefSeq>
      </ExclFromRefSeq>
      <PropertyList>
        <string>full-genome-representation</string>
        <string>has-chloroplast</string>
        <string>has_annotation</string>
        <string>latest</string>
        <string>latest_genbank</string>
        <string>latest_refseq</string>
        <string>refseq_has_annotation</string>
        <string>representative</string>
        <string>wgs</string>
</PropertyList>

additionally, we have all of the STATS tags.

<Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">0</Stat> <Stat category="contig_count" sequence_tag="all">48315</Stat>
 <Stat category="contig_l50" sequence_tag="all">6073</Stat> <Stat category="contig_n50" sequence_tag="all">60046</Stat> 
<Stat category="non_chromosome_replicon_count" sequence_tag="all">1</Stat> <Stat category="replicon_count" sequence_tag="all">1</Stat> 
<Stat category="scaffold_count" sequence_tag="all">7453</Stat> <Stat category="scaffold_count" sequence_tag="placed">1</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">7452</Stat> <Stat category="scaffold_l50" sequence_tag="all">320</Stat>
 <Stat category="scaffold_n50" sequence_tag="all">1281786</Stat> <Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat> </Stats>

right now i collect each one combining the category and tag so for example, scaffold_count_all, scaffold_count_placed, etc. would we want ALL of these as properties?

bradfordcondon commented 5 years ago

Stats are added as properties.

I'd like to get the "date performed" value to put in the chado base table. However, tehre are so many date tags, how would we know which is the irght one?

<AsmReleaseDate_GenBank>2016/06/01 00:00</AsmReleaseDate_GenBank>
      <AsmReleaseDate_RefSeq>2017/07/14 00:00</AsmReleaseDate_RefSeq>
      <SeqReleaseDate>2016/06/01 00:00</SeqReleaseDate>
      <AsmUpdateDate>2017/07/19 00:00</AsmUpdateDate>
      <SubmissionDate>2016/06/01 00:00</SubmissionDate>
      <LastUpdateDate>2017/07/19 00:00</LastUpdateDate>

SubmissionDate seems like a good choice. I dont know what happens if we have 2 assemblies, though.

mpoelchau commented 5 years ago

'Date performed' is one of those pieces of metadata that we have never aspired to collect from the user - often because we're importing assemblies, and you don't perform an arthropod genome assembly on a single day. I'm honestly not sure what the 'date performed' is meant for. But, it's required. I think you are correct that SubmissionDate is probably the best alternative.

When would you have 2 assemblies for a given analysis? I guess everything's possible...

bradfordcondon commented 5 years ago

<WGS>LVXX01</WGS> gets added as a dbxref (see #58 )

When would you have 2 assemblies for a given analysis?

I think I was thinking for genbank vs refseq, but youre right, those have their own unique release date tags. so now im not sure what my concern was :)

we'll go with submission date.