biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

Open Question: Genes of multiple species (aka. organisms) in one geneset? #10

Closed dongbohu closed 1 year ago

dongbohu commented 3 years ago

All genes in a Tribe geneset are associated with same species (organism). Does it make sense to allow users to create a geneset that includes genes of multiple species?

If it does, then the tax_id field in geneset schema should be changed from a single integer value to a list of integers.

cgreene commented 3 years ago

I think this is an important item to resolve. We could imagine having one geneset per, say, GO term or we could imagine having one geneset per GO term per species. It is likely to affect how the frontend is designed as well.

cgreene commented 3 years ago

As a user, I think my natural first expectation would be that each species has its own gene set. This is probably why it is designed that way in Tribe. If we do things with multiple species per gene set, we will want to make sure our export and API instructions give folks easy ways to filter.

cgreene commented 3 years ago

Search performance might be better (fewer duplicate genesets with identical names) if we allowed multiple species per geneset so there wouldn't be 10+ GO terms with the same name.

newgene commented 3 years ago

@cgreene I believe @dongbohu was referring to the case that a single pathway or "functional set" but contains genes across multiple species (for whatever reason, two genes from different species are related in a same functional set. I actually don't have a real example for it, so I'd leave this issue till we have realworld examples).

In the case of a GO annotations as you described, the genesets from different species annotated to the same GO term, they should be considered a different "functional set" and stored in different "geneset" objects. We need to decide how we want to assign _id field to those geneset objects since they all have the same go ids.

ravila4 commented 3 years ago

Regarding the _id field for GO terms, I am currently using:[goterm]-[taxid]

newgene commented 3 years ago

@ravila4 you may want to avoid special char in _id, maybe like goid_taxid.

ravila4 commented 3 years ago

As an alternative, perhaps we could allow the user to create sets of sets? I imagine we could use a relationship such as "parent" and "children" which allows us to combine sets without duplicating the data. This setup could also allow us to avoid replicating data in hierarchical resources such as GO.

dongbohu commented 3 years ago

Yeah, in the meeting last week, I forgot that multiple species may be associated with one GO term.

vincerubinetti commented 3 years ago

I am not in the target users, but FWIW I don't see any benefit to limiting sets to one species and it rests on pretty big assumptions on how people will use the tool. I guess the idea is that most people are using it to perform analyses on large swaths of genes, in which case they probably will only be dealing with one species at a time. But perhaps someone else might want to look at the same gene across multiple different species for some reason? Maybe they want to look at genes that are related to a specific disease across different species and want to gather/list/format them in one place? Or some other use case that we can't think of.

It seems to cost us pretty much nothing and may even have some benefits?

we will want to make sure our export and API instructions give folks easy ways to filter.

It wouldn't be hard to make the frontend do this.

ravila4 commented 3 years ago

If we allow multiple species per geneset, we need to add some additional optional fields to the schema. I propose that for such a document, the top level "taxid" field could take the string "multispecies" to indicate that each of the genes may correspond to a different organism.

{
    "taxid": "multispecies"      // Change mapping type from integer to keyword
    "genes": [
            {
            "taxid": "9606",        // New optional field for each gene
            ...
            },
            {
            "taxid": "10090",
            ...
            }
        ...
        ]

}

Thoughts? @cgreene @vincerubinetti @dongbohu

ravila4 commented 3 years ago

Another better option: instead of the string "multispecies", we can make the top-level "taxid" field an array of all the unique taxid values. That way we get to keep the data type as integer.


{
    "taxid": [9606, 10090]
    "genes": [
            {
            "taxid": 9606,        // New optional field for each gene
            ...
            },
            {
            "taxid": 10090,
            ...
            }
        ...
        ]

}
ravila4 commented 3 years ago

Possible applications could be storing sets of homologs/orthologs.

ravila4 commented 1 year ago

This feature has already been implemented, it is possible to create genesets with genes from multiple organisms, and each gene is annotated with the taxid of its source organism.