brain-bican / metadata-schemas

A repository for BICAN metadata schemas and their development.
4 stars 3 forks source link

SSSOM gene mapping examples #12

Closed patrick-lloyd-ray closed 1 year ago

patrick-lloyd-ray commented 1 year ago

@satra @lydiang

This is done with SSSOM, just showing functionality in two forms: one across species, one within. For revision, we'd use within and simply specify the IDs to tie them to a specific revision (even though my example maps ensembl to ncbigene, it's just to illustrate).

Happy to iterate, tie into LinkML models, experiment with how to make this work in our ecosystem once we get the elements correct.

patrick-lloyd-ray commented 1 year ago

i left one question.

the main question i would have is do you see a way to automate this from a mapping released by ensembl/ncbigene? as in, can one write a function or transform?

This would be tremendously helpful.

From what I've seen, there are some tools that will query these databases (https://github.com/pachterlab/gget is an example) but I haven't seen a way to automate this from a mapping released by ensembl/ncbigene.

I will look into this more deeply and give an update in the next 24 hours.

lydiang commented 1 year ago

Regarding mapping.

We need to make sure we have good provenance of this. At each annotation release for Ensembl and NCBI, the also release mappings (NCBI-Ensembl) and orthologs.

There is no guarantee that mappings from each authority is identical. We should ingest mapping like we do the annotation - eg with version and source etc.

patrick-lloyd-ray commented 1 year ago

Regarding mapping.

We need to make sure we have good provenance of this. At each annotation release for Ensembl and NCBI, the also release mappings (NCBI-Ensembl) and orthologs.

There is no guarantee that mappings from each authority is identical. We should ingest mapping like we do the annotation - eg with version and source etc.

The metadata fields subject_source_version and object_source_version are used to track versions in this model. We can add other fields, if that's not sufficient.

patrick-lloyd-ray commented 1 year ago

@satra I can't find any existing tooling that would automate this completely.

There are things like: GIDcon and biomart that can produce tables that will map IDs, but we'd have to curate them manually here with metadata. We could schedule a quarterly release/update, though, if it's not a ton of work.

satra commented 1 year ago

@sooyounga and i can figure out how to parse and write as long as there is a simple enough mapping. would it be possible for you to let @sooyounga know which source fields you put into the sssom fields?

patrick-lloyd-ray commented 1 year ago

@sooyounga and i can figure out how to parse and write as long as there is a simple enough mapping. would it be possible for you to let @sooyounga know which source fields you put into the sssom fields?

Yes, the source fields are listed in the table.

I also came across this tool (https://biit.cs.ut.ee/gprofiler/convert) that may be useful as well.