DataONEorg / slinky

Slinky, the DataONE Graph Store
Apache License 2.0
4 stars 4 forks source link

access datasets from DataONE updates queue #13

Open mbjones opened 3 years ago

mbjones commented 3 years ago

To process all data that comes into DataONE, we need to be able to get access to the metadata files in an efficient manner, and be notified when new updated revisions or changes to SystemMetadata are available. This is the same access problem from k8s for the MetDIG processor and the DataONE index processor, and can likely use the same solution. We have discussed making the metadata documents available from a known location on a read-only Ceph filesystem that is mounted into appropriate containers. A queue system would be notified when PID changes occurred that need to be processed by various subscribers, and then they would be able to access the data directly from the Ceph filesystem without undertaking a REST call to Metacat (and without making a cached copy).

The Related issue in Metacat is https://github.com/NCEAS/metacat/issues/1436 for designing such a system.

ThomasThelen commented 3 years ago

I definitely like the idea of the mounted filesystem. Under this architecture, it might be possible to get rid of the SOLR queries and requests that are made while initially populating the graph. I imagine instead we'd just crawl the mounted system and process each file. If it's a new instance of the graph, these can be done sequentially without worrying if they've already been processed.

It also sounds like we might want a REST endpoint so that Slinky can be properly alerted when there's a new file that needs to be processed.

I think this warrants a bigger discussion around the architecture of the DataONE services. Is Slinky going to share its redis instance with other DataONE services? If so maybe we'd be better off using that instead of a REST endpoint.