Make Gemini data portable between sites

dannylamb commented 3 years ago

Right now, Gemini internally stores full urls for both Drupal and Fedora, which has unintentionally baked in the domain(s) for your content. This makes it difficult to keep your content but change domains, which happens more often than you would think. Moving content from a test server to a prod server right now is effectively blocked by these baked in domains.

I'm not saying Gemini's response structure should change. I still want it to give me two full URLs in the response. Just not store that full URL directly in the database. If we store relative URLs, then we have options for how we want to deal with changing domains. Ideally one would just configure the Drupal and Fedora base urls and Gemini would do the rest. That way we can dump and then re-import data on different servers with different domains.

seth-shaw-unlv commented 3 years ago

But then you would need two separate Gemini services for multi sites...

dannylamb commented 3 years ago

Yes, just making domain straight config would introduce that for a multi site. That's not so big of a deal if you're running ISLE, but I see how that would be an unwelcome complication otherwise.

The big deal here is that the data is not portable. As long as there's some way I can move it from test to prod without sed'ing a db dump (which I'm currently doing :vomiting_face: ), I'm happy.

dannylamb commented 3 years ago

@seth-shaw-unlv I see two acceptable paths forward here. One is to keep the current behaviour as default but respect configuration for domains if provided. The other is to write a Symfony console command in Gemini to migrate domains, and not to allow any sort of configuration. There's pros/cons for both, so I'm interested in input before I go down one path or the other.

If we push it into configuration, it's essentially a toggle on that behaviour, which is acceptable. If you wanted to migrate a multi-site though you'd still be stuck in the position we're in now.

If we go the 'migration' route, it would potentially take a long time to run on large datasets, and just be in general less clean of a solution. There's also no way to do regex updates for both mysql and postgresql, so I'll probably have to update the db schema to separate domains from relative paths. Then if we 'migrate' that data, it can be a simple update query on the domain column instead of regexery.

elizoller commented 3 years ago

Personally, I'd lean towards the config on/off for storing the whole URL versus the relative path. Getting into maintaining multiple regex update methods for mysql and postgres sounds complicated and fragile.

seth-shaw-unlv commented 3 years ago

Personally, I'd lean towards the config on/off for storing the whole URL versus the relative path. Getting into maintaining multiple regex update methods for mysql and postgres sounds complicated and fragile.

I'll second this.

dannylamb commented 3 years ago

I feel confident in proceeding with a toggle and working out the kinks from there.

dannylamb commented 3 years ago

So... something I'm noticing here. This totally will mess with the recast service, which needs to look things up by the fedora uri. I'm trying to wrap my head around how I can thread this needle here.

I'll ask point blank, is anyone using the recast service? We built it for a perceived need at the time, but I don't know if that need has manifested.

elizoller commented 3 years ago

we are not using it right now because we aren't exposing our fedora or triplestore to the public. should we decide to expose our RDF to the world - i would want to do so using recast because the URIs seem more accurate. but at this point that need hasn't actually arisen.

seth-shaw-unlv commented 3 years ago

I think we are going to use a separate indexing strategy for our public triplestore and no one is going to have access to our Fedora, so we probably won't need it.

elizoller commented 3 years ago

"separate indexing strategy" - as in not blazegraph?

dannylamb commented 3 years ago

i guess the findByUri in Gemini will just need to respect the same configuration. i'm still working through the particulars of all that though. and i'm not sure all the implications to switching recast to use fedora path instead of uri. so i've got a bit to sort out here. i'll report back.

seth-shaw-unlv commented 3 years ago

It will be a separate, publicly queryable, instance of blazegraph, but we probably won't be reusing the indexing action since the version of JSON-LD it get's is based off the user that creates/updates the node/term/media. We will probably use the event dispatcher to have a PHP script grab the Anon user's view of JSON-LD and POST that to blazegraph.

dannylamb commented 3 years ago

I think I've managed to get this sorted out without having to migrate the data. I'm updating tests now and I'll throw up a PR.

seth-shaw-unlv commented 3 years ago

Can we close this ticket with https://github.com/Islandora/Crayfish/commit/afdb8d02f3090b5f05a9ff766ae14e375830fc47, or is there something else we need to address first?

dannylamb commented 3 years ago

We've ditched Gemini, so closing. This issue just highlights how weird and bug-prone it can be.

Islandora / documentation

Make Gemini data portable between sites #1664