clearlydefined / service

The service side of clearlydefined.io
MIT License
45 stars 40 forks source link

Automatically harvest source locations from curations #560

Open iamwillbar opened 5 years ago

iamwillbar commented 5 years ago

I'm not sure if we do this today, but if we don't we should :). When someone submits a curation to specify a source location we should automatically queue that location for harvesting.

tmarble commented 4 years ago

We will check curations for sourceLocations when a user's contribution has been "sync'd" (after pressing the "Contribute" button at least once).

Presumably the website client will call this endpoint upon clicking 'Sync': https://github.com/clearlydefined/service/blob/master/routes/curations.js#L92

Which calls the function syncAllContributions in https://github.com/clearlydefined/service/blob/master/providers/curation/github.js#L45

.. in turn calls _processContributions(prs) in github.js at https://github.com/clearlydefined/service/blob/master/providers/curation/github.js#L62

This is the call site where we can inspect each storedContribution to see if it specifies a sourceLocation, and, if so, add it to the harvest queue.

NOTE: It would appear that the current test coverage doesn't yet have an example of actually calling syncAllContributions as it is simply mocked in https://github.com/clearlydefined/service/blob/master/test/providers/curation/processTest.js#L22

In particular it is not evident what the expected structure of storedContribution can be and it cannot be deduced from test examples.

Assuming we did have a sourceLocation example -- like the one corresponding to: https://github.com/clearlydefined/service/blob/master/test/fixtures/curation-valid.1.yaml#L24

Then it is not evident the proper call path to enqueue the sourceLocation. Presumably, based on configuration, this would ultimately call an Azure queuing function: https://github.com/clearlydefined/service/blob/master/providers/queueing/azureStorageQueue.js#L27 or a memory queuing function: https://github.com/clearlydefined/service/blob/master/providers/queueing/memoryQueue.js#L21

It is possible that the common call location for enqueuing is (via the harvest function): https://github.com/clearlydefined/service/blob/master/providers/harvest/crawlerQueue.js#L13

The crawler is constructed here: https://github.com/clearlydefined/service/blob/master/providers/harvest/crawlerQueueConfig.js#L24

..from here: https://github.com/clearlydefined/service/blob/master/providers/harvest/crawlerConfig.js#L7

..from here: https://github.com/clearlydefined/service/blob/master/providers/index.js#L46

..from here: https://github.com/clearlydefined/service/blob/master/bin/config.js#L50

Thus it is not evident which is the proper namespace to require at the top of github.js to invoke the harvest function to enqueue the new sourceLocation.