Silthus commented 9 months ago

Hey Openbeta Team :)

As discussed in https://github.com/OpenBeta/open-tacos/issues/1031 it would be great to have a simple page where users can bulk import data using a simple, validated json schema.

This PR adds the required functionality to the backend, enabling the integration of a bulk import in the frontend. The frontend is also actively being worked on by @l4u532.

What this PR does:

it changes from apollo-server to apollo-server-express allowing the integration of additional http endpoints without changing the graphql api. You can read more about the swapping of the packages here: https://www.apollographql.com/docs/apollo-server/v3/integrations/middleware/#swapping-out-apollo-server
it adds a new endpoint a /import for authorized users, allowing the import of a defined json schema. The schema will be defined in the PR for the frontend.
the data is inserted using the standard functions used by the mutations of the graphql api ensuring data validity.
I also cleaned up some of the tests and removed the timeouts waiting for the changes to emit and swapped the database for an in memory database everywhere, getting rid of the required docker instance for testing
everything is end2end tested and was developed using a test first approach

Let me know what you think. I will happily make any adjustments to the PR.

Silthus commented 9 months ago

Thanks @vnugent for the feedback on Discord about discussing the pros and cons of using the existing GraphQL API vs. creating a new REST endpoint first.
For documentation purposes, I will continue the discussion here.

First off: you are absolutely right to discuss the pros and cons before going forward with such a change. You built such an amazing product and we are very happy to hear your thoughts and feedback about the change we are about to implement.
Frankly, we were so deep into coding, that we forgot to share our train of thought. So here is the pro and con list we went through before deciding to go with the extra /import endpoint. We are eager to hear your thoughts about it and more than willing to change things if you disagree or have concerns.

REST vs GraphQL for Bulk-Import of Data

[Draft] Example import.json

```json { "areas": [ { "areaName": "Utah", "countryCode": "us", "children": [ { "areaName": "Southeast Utah", "children": [ { "areaName": "Indian Creek", "children": [ { "areaName": "Supercrack Buttress", "climbs": [ { "name": "The Key Flake", "grade": "5.10", "fa": "unknown", "type": { "trad": true }, "safety": "UNSPECIFIED", "metadata": { "lnglat": { "type": "Point", "coordinates": [ -109.54552, 38.03635 ] }, "left_right_index": 1 }, "content": { "description": "Cool off-width that requires off-width and face skills.", "protection": "Anchors hidden up top. Need 80m to make it all the way down.", "location": "Opposite keyhole flake. Obvious right leaning offwidth that starts atop 20 ft boulder." } }, { "name": "Incredible Hand Crack", "grade": "5.10", "fa": "Rich Perch, John Bragg, Doug Snively, and Anne Tarver, 1978", "type": { "trad": true }, "safety": "UNSPECIFIED", "metadata": { "lnglat": { "type": "Point", "coordinates": [ -109.54552, 38.03635 ] }, "left_right_index": 2 }, "content": { "description": "Route starts at the top of the trail from the parking lot to Supercrack Buttress.", "protection": "Cams from 2-2.5\". Heavy on 2.5\" (#2 Camalot)", "location": "" }, "pitches": [ { "pitchNumber": 1, "grade": "5.10", "disciplines": { "trad": true }, "length": 100, "boltsCount": 0, "description": "A classic hand crack that widens slightly towards the top. Requires a range of cam sizes. Sustained and excellent quality." }, { "pitchNumber": 2, "grade": "5.9", "disciplines": { "trad": true }, "length": 30, "boltsCount": 0, "description": "Easier climbing with good protection. Features a mix of crack sizes. Shorter than the first pitch but equally enjoyable." } ] } ], "gradeContext": "US", "metadata": { "isBoulder": false, "isDestination": false, "leaf": true, "lnglat": { "type": "Point", "coordinates": [ -109.54552, 38.03635 ] }, "bbox": [ -109.54609091005857, 38.03590033981814, -109.54494908994141, 38.03679966018186 ] }, "content": { "description": "" } } ], "metadata": { "lnglat": { "type": "Point", "coordinates": [ -109.5724044642857, 38.069429035714286 ] } }, "content": { "description": "Indian Creek is a crack climbing mecca in the southeastern region of Utah, USA. Located within the [Bears Ears National Monument](https://en.wikipedia.org/wiki/Bears_Ears_National_Monument)." } } ] } ] } ] } ```

Bulk Importing on the Server Side Using a New REST Endpoint

This is done using a separate endpoint, swapping out the apollo-server package against the apollo-server-express package (see here for more details).

Pro

everything is done in a single transaction, rolling back all changes if one operation fails, and only committed if everything has run successfully 1x
does not break if large amounts of data are uploaded
no queuing mechanism needed for handling the import of several area trees
proven mass upload mechanism battle tested with mountain project import
new fields in the database are automatically exposed without the need to adjust the GQL
potentially faster (however this was not tested and is just speculation)

Cons

little work is needed in the core to swap out apollo-server against apollo-server-express
authentication needs to be handled separately, although sharing most of the auth logic
adds complexity by introducing an additional endpoint in the api

Importing via GQL

This would be done by processing the json document on the client constructing a GraphQL query and sending it to the server.

Pro

requires little to no server side changes
uses the same permissions and auth as the current frontend and api
no need to swap out the server engine

Cons

every mutation runs as a single transaction
- failing a mutation deep down in the tree leaves the import only partially completed
- partial imports cannot be repeated as mutations that already executed will fail
- could leave imports in an uncompleted state
additional non type safe logic is required to construct the correct GraphQL query, which is potentially unstable
unsure how very large GraphQL queries are handled by the server, using a REST endpoint on the other hand has been battle tested with large data
changing the GrapQL api requires adjustment of the JSON parser

Conclusion

We believe that doing the little bit of extra work and implementing a custom endpoint for importing bulk data is the right way to go.

The main reason for that is the transaction safety ensuring that a single .json file can be imported, using multiple retries to fix possible errors, without breaking the import and leaving the database in a partial state.

The second biggest reason we prefer the server side import is the additional complexity it brings to changing the GraphQL endpoints. By separating it from the regular API, a new contributor or existing collaborator does not need to touch the frontend of the bulk importer when changing the GraphQL API. Also we don't really know how it would behave with large amounts of data.

Plans for the Future

Here are some additional ideas and thoughts we had regarding the import feature, which also influenced our decision:

Export -> Edit -> Import Workflow

The idea is to add an export function that allows exporting selected areas or regions with individual properties into a single json file which then can be edited and uploaded using the import page.
This would allow for a very easy way to update existing data. The export would be similar to the existing openbeta-export, except that it produces a single json file in the schema required by the import.

Data Cleanup

One issue with Openbeta is the quality of data. Adding such an export and import workflow could allow individual, non developers, to start cleaning up data in a larger scope.

vnugent commented 9 months ago

@Silthus I'd be more comfortable if we still use GQL as the interface (not introducing REST in this PR). You can create a new GQL bulk mutation and reuse the all DB operations in import-json.ts. There's no need to call existing update APIs.

Silthus commented 9 months ago

OK, I can understand why you are more comfortable with sticking to the GraphQL only API.

But I am not quite sure what you mean. Can you share a small example snippet or link to some docs? I just need to direction to head towards and will change the bulk import accordingly.

vnugent commented 9 months ago

You can piggyback Area GQL. I'm using updateAreaSortingOrder mutation as an example.

Add a new mutation definition to AreaEdit.gql, ex bulkUpload or whatever.
- You'll need to explicitly define the input param. This is the schema/structure of the bulk json object. This helps with param validation and provides schema for GQL explorers.
- Optionally, you can also create a Custom scalar to encapsulate the bulk upload payload. Example of existing custom scalar
Add a GQL mutation mapping to JS model (analogous to REST 'controller')
- https://github.com/OpenBeta/openbeta-graphql/blob/develop/src/graphql/area/AreaMutations.ts#L72
- You can make DB calls at this layer but we delegate DB operations to 'models', subclasses of Apollo DataSource. I'd create a new subclass of MutableAreaDatasource, maybe call it BulkDataSource, and move the logic you have in import-json.ts there.

Silthus commented 9 months ago

@vnugent thanks that helped a lot!

I adjusted the code accordingly, the only thing left to do are some failing tests where the input doesn't match up.

Let me know what you think of the changes.

I intentionally left the exchange of the apollo-server-express package in there to decouple the tests from a real running server listening on a port on localhost. This significantly speeds up testing and reduces complexity without any disadvantages.

Silthus commented 9 months ago

I have a small request to check both nullness and defined. Other than that we're good to merge.

Thanks I learned something again today :)

OpenBeta / openbeta-graphql

feat(api): add /import endpoint to support bulk importing json data #390

REST vs GraphQL for Bulk-Import of Data

Bulk Importing on the Server Side Using a New REST Endpoint

Pro

Cons

Importing via GQL

Pro

Cons

Conclusion

Plans for the Future

Export -> Edit -> Import Workflow

Data Cleanup