geneontology / noctua-models

This is the data repository for the models created and edited with the Noctua tool stack for GO.
http://noctua.geneontology.org/
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

remove models with redundant IRIs #158

Open goodb opened 3 years ago

goodb commented 3 years ago

the dev branch of noctua-models contains a bunch of models that share their IRI with another model

e.g. I see a model named WB_WBGene00011688 and a model named 323f7ea5-6d4b-4d54-a555-386c6df7a9c6 and both have model IRI http://model.geneontology.org/323f7ea5-6d4b-4d54-a555-386c6df7a9c6

We can't have multiple models with the same IRI in the minerva triple store. Right now, the model loader will load the first one it sees and then report an error and skip loading any other models that use the same IRI.

dustine32 commented 3 years ago

Not sure yet, but my current theory is that the UUID.ttl files were created/exported during the bulk taxon update. Here are the histories for WB_WBGene00011688.ttl and 323f7ea5-6d4b-4d54-a555-386c6df7a9c6.ttl.

The UUID.ttl file has the added model-level in_taxon property, which we want. But since the MOD import code still generates a random UUID for the model at each run, it will probably be difficult matching updated models to their staler versions the next time a fresh batch is cooked. This might also be moot as I've already implemented writing out multiple models to a single N-Quads (.nq) file.

So I guess I'm now just looking for confirmation from @goodb @kltm that I should just delete the gene_id.ttl (WB_WBGene00011688.ttl) files and keep the UUID.ttl (323f7ea5-6d4b-4d54-a555-386c6df7a9c6.ttl) files. We can figure out later how the next MOD imports load will identify which files to delete/replace. Does this sound like a good plan?

goodb commented 3 years ago

@dustine32 I'm not sure about your plans regarding the multiple-models per file option. If you are going with one model per gene, I think it would be a better idea to replace the contents of the gene-named files with the correct data (with the taxon) and drop the UUID titled files. As long as we are using github for this, keeping the file names for the same things stable across changes is better.

goodb commented 3 years ago

@dustine32 if the idea is to shift to using large multi-model nquads files we would want to check on the model loading process. Assuming that is good, then drop all the other previous forms here.