GMOD / Chado

the GMOD database schema
http://gmod.org/wiki/Chado
Artistic License 2.0
37 stars 25 forks source link

Changes to the Project and Biomaterial table #41

Open spficklin opened 6 years ago

spficklin commented 6 years ago

This issue imported from the Chado v1.4 requested changes google doc:

Requests by Andrew Farmer

http://gmod.827538.n3.nabble.com/RFC-Chado-relase-v1-3-tp4048655p4048675.html

bradfordcondon commented 6 years ago

@spficklin @scottcain and I agree that the general components of MAGE make sense to move to a general module as suggested. However it also messes with the documentation while not providing additional functionality, so we'll keep it as is.

@adf-ncgr this looks like your issue so if you have aditional comments we welcome them.

adf-ncgr commented 6 years ago

@bradfordcondon thanks for revisiting this! I don't recall discussion of moving tables to another module, but don't have strong feelings about that. The original motivation behind my suggested changes was to try to accommodate import of information from NCBI BioProject and BioSample for use with SRA data (not only RNA-seq). We haven't done too much on that front in our project context since the suggestions were made, but it remains of interest.

Just reviewing @spficklin comments with respect to whether foreign keys are redundant with linker tables, I think the idea was to use the foreign key when there was a "primary" relationship (e.g. the project that generated the biomaterial, as opposed to subsequent projects in which it may have been reused)- similar to other things already present in Chado such as feature.dbxref_id and feature_dbxref. But again I don't have strong feelings about this.

mpoelchau commented 6 years ago

We have a similar use case to what @adf-ncgr describes (we want to store NCBI BioProject/BioSample metadata) and would appreciate this addition.

ekcannon commented 6 years ago

The experiment design fields in MAGE have wider application, not just for microarray experiments. Would it make sense to pull them into their own module? They can be used in-situ, but some people, especially purists or newbies, may be uncomfortable using MAGE tables for non-microarray data, or may not think to do so. The nd_experiment table doesn't really serve the need of an experiment table, in spite of its name.

Yes, please add a type_id to the project table, even if I have to re-write loader scripts.

Yes, a stock_id field or a stock_biomaterial table would be necessary.

I would like to use a set of experiment tables (biomaterial, biomaterial_stock, biomaterial_dbxref, ?) for assembly, gene model, RNA-seq, transcriptome metadata.

laceysanderson commented 6 years ago

I would also be Very happy with a project.type_id as KnowPulse is currently storing the type in the projectprop table which I don't like ;-P

I also support changing ALL "description" fields to type text, especially the project table.

ekcannon commented 6 years ago

After doing some experimenting with genome assembly metadata, the existing tables, along with the recommended new tables and fields above work pretty well. But I see a problem.

The biomaterial table has an organism_id field (taxon_id). Adding the field, stock_id, would create a potential situation in which the biomaterial could be attached to one organism record and the stock to another. Same situation could come up with adding linker table, biomaterial_stock. But it must be possible to attach a biomaterial record to a stock record if it is used to describe a biosample.

Removing an existing field (taxon_id) from biomaterial seems like a big no-no. Perhaps, since the taxon_id field is optional, we could recommend that people attach a biomaterial record to an organism or a stock, but not both. Otherwise, there seems no good solution.

Linker table vs stock_id field: is it possible that more than one stock could be attached to a biomaterial record? Although I can't see how that could happen, this is biology, so I'd recommend a linker table over adding a field.

scottcain commented 5 years ago

I really enjoy @ekcannon 's last item in the previous comment. How many times have I said I don't really see how that could happen and then biology proves me wrong. Now that some time has passed, is there any more insight on this issue?

ekcannon commented 5 years ago

My thoughts on this haven't changed much.

Probably a different issue: yes on making all description fields of type TEXT.

bradfordcondon commented 5 years ago

how would one link a project to an organism? My reasoning is if one used the project table to hold an NCBI bioproject, that record is directly linked to an NCBI Taxon which we store in organism.

Should there also be a project_organism table?

edit after reading @adf-ncgr 's resposne below perhaps the question is "do all NCBI bioprojects have biosamples" and if the answer is yes, then the current schema is sufficient.

adf-ncgr commented 5 years ago

I guess this could be handy for bioprojects that are placeholders (ie before they have "real data" associated), though there may be some "denormalization danger" lurking when biosamples and such are actually linked in (which is how I was imagining projects would get associations with organisms). From what I can see at NCBI (without actually looking at a schema), it looks like it might be sufficient to add an optional FK to organism in the project table to support the use case of being like NCBI's Bioproject. These are just some quick thoughts/2c, I don't really have any objection to the suggestion you have made about a many-to-many linker table which does seem a somewhat more "chadoesque" approach to such things.

Thanks for keeping this moving along!

ekcannon commented 5 years ago

That is my concern too: risk of denormalizing errors. Up to this point I've been able to link projects to organisms through attached records in other tables, but it's conceivable that one would need to link a project to an organism and have no other way to do so except via a project_organism table.

Do you have a specific example, @bradfordcondon?

As it is, it's possible to do a "denormalized data mapping" in which the same data field value is in more than one table. Chado requires a fair bit from the database architect to ensure this doesn't happen as I think it may be impossible to prevent through the schema itself.

adf-ncgr commented 5 years ago

Well, FWIW here's an example of what I meant by a placeholder: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA521298 It indicates the organism concerned, but not by way of any submitted data (yet). Not sure that it's important to support this, but I do think that it's a pretty nice mechanism to make people aware of what activities may be in progress in the community.

bradfordcondon commented 5 years ago

Do you have a specific example, @bradfordcondon?

I do but its more of an NCBI issue than anything. This bioproject doesnt return any biosamples in the XML (reproduced below).

https://www.ncbi.nlm.nih.gov/bioproject/?term=477511 The XML is what we'll get from query EUtils, so, programmatically, we wont have a biosample to link to this bioproject. However, the actual WEBSITE has the biosample attached to the project. I think that it gets this info because the biosample includes the bioproject link, but not the other way around. What I'm saying is, theres probably a workaround I can implement on my end (if no biosample is included in the XML, query the Eutils biosample database for samples listed with that project.).

<?xml version="1.0"?>
<RecordSet><DocumentSummary uid="477511">
    <Project>
        <ProjectID>
            <ArchiveID accession="PRJNA477511" archive="NCBI" id="477511"/>
            <LocalID>bp0</LocalID>
            <LocalID>bp0</LocalID>
        </ProjectID>
        <ProjectDescr>
            <Name>Apis mellifera strain:DH4</Name>
            <Title>Apis mellifera strain:DH4 RefSeq Genome sequencing and assembly</Title>
            <Description>The reference sequence (RefSeq) genome assembly is derived from the submitted GenBank assembly (see linked project PRJNA471592). Annotation provided on the RefSeq genomic records is based on NCBI annotation pipeline.</Description>
            <ExternalLink label="Matthew Webster's group webpage">
                <URL>http://www.imbim.uu.se/forskargrupper/genetik-och-genomik/Webster_Matthew/</URL>
            </ExternalLink>
            <Publication id="8417993" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <Publication id="24479613" status="ePublished">
                <Reference/>
                <DbType>ePubmed</DbType>
            </Publication>
            <ProjectReleaseDate>2018-06-19T00:00:00Z</ProjectReleaseDate>
            <Relevance>
                <ModelOrganism>yes</ModelOrganism>
            </Relevance>
            <RefSeq representation="eReference">
                <AnnotationSource>
                    <Name>NCBI annotation pipeline</Name>
                </AnnotationSource>
            </RefSeq>
        </ProjectDescr>
        <ProjectType>
            <ProjectTypeSubmission>
                <Target capture="eWhole" material="eGenome" sample_scope="eMonoisolate">
                    <Organism species="7460" taxID="7460">
                        <OrganismName>Apis mellifera</OrganismName>
                        <Strain>DH4</Strain>
                        <Supergroup>eEukaryotes</Supergroup>
                    </Organism>
                </Target>
                <Method method_type="eSequencing"/>
                <Objectives>
                    <Data data_type="eAnnotation"/>
                </Objectives>
                <IntendedDataTypeSet>
                    <DataType>genome sequencing and assembly</DataType>
                </IntendedDataTypeSet>
                <ProjectDataTypeSet>
                    <DataType>RefSeq genome sequencing and assembly</DataType>
                </ProjectDataTypeSet>
            </ProjectTypeSubmission>
        </ProjectType>
    </Project>
    <Submission last_update="2018-05-16" submission_id="SUB4045961" submitted="2018-05-16">
        <Description>
            <!-- Submitter information has been removed -->
            <Organization role="owner" type="institute" url="http://www.ncbi.nlm.nih.gov">
                <Name abbr="NCBI">National Center for Biotechnology Information</Name>
                <!-- Contact information has been removed -->
            </Organization>
            <Access>public</Access>
        </Description>
        <Action action_id="SUB4045961-bp0"/>
    </Submission>
    <ProjectLinks>
        <Link>
            <ProjectIDRef archive="NCBI" id="477511" accession="PRJNA477511"/>
            <PeerProject>
                <CommonInputData>eRefseqGenbank</CommonInputData>
                <MemberID archive="NCBI" id="471592" accession="PRJNA471592"/>
            </PeerProject>
        </Link>
    </ProjectLinks>
</DocumentSummary>

</RecordSet>

@mpoelchau has pointed out this is weirder than i initially thought. The biosample thats listed via the website only links back to a different bioproject:

<Link type="entrez" target="bioproject" label="PRJNA471592">471592</Link>

She has suggested maybe this has to do with it being a refseq bioproject, which lacks the biosample link?

spficklin commented 5 years ago

@bradfordcondon can you clarify. Does NCBI provide the taxon information for the BioProject or do you get that information from the BioSample record?

in the XML you posted above there is this section that provides details about the project type's target organism. Is this the organism data you are hoping to link to your project?

         <ProjectType>
            <ProjectTypeSubmission>
                <Target capture="eWhole" material="eGenome" sample_scope="eMonoisolate">
                    <Organism species="7460" taxID="7460">
                        <OrganismName>Apis mellifera</OrganismName>
                        <Strain>DH4</Strain>
                        <Supergroup>eEukaryotes</Supergroup>
                    </Organism>
                </Target>
                <Method method_type="eSequencing"/>
                <Objectives>
                    <Data data_type="eAnnotation"/>
                </Objectives>
                <IntendedDataTypeSet>
                    <DataType>genome sequencing and assembly</DataType>
                </IntendedDataTypeSet>
                <ProjectDataTypeSet>
                    <DataType>RefSeq genome sequencing and assembly</DataType>
                </ProjectDataTypeSet>
            </ProjectTypeSubmission>
        </ProjectType>

I guess what I'm getting at, if you stored the NCBI BioSample in the chado.biomaterial table would that be sufficient? Then you could link the project to the biomaterial via a project_biomaterial table and you'd have access to the organism info for that BioSample via the taxon_id field.

The more I think on this problem I don't see how you are not going to create a normalization problem.

project -> project_biomaterial -> biomaterial -> organism project -> project_stock -> stock -> organism project -> project_analysis > analysis -> analysisfeature -> feature -> organism

In any of those cases you've got multiple paths to the organism table resulting in potential conflicts.

bradfordcondon commented 5 years ago

Does NCBI provide the taxon information for the BioProject or do you get that information from the BioSample record?

The taxon information comes from the project. theres no biosample in that XML file, which is the problem.

in the XML you posted above there is this section that provides details about the project type's target organism. Is this the organism data you are hoping to link to your project?

yes!

The more I think on this problem I don't see how you are not going to create a normalization problem.

Yes I can see how having projects connect to everything is going to be a big challenge from a normalization standpoint.