gbif / registry-console

Apache License 2.0
5 stars 2 forks source link

Remove "old" ingestion history? #331

Open timrobertson100 opened 4 years ago

timrobertson100 commented 4 years ago

We are now in the process of tearing down the old rabbit based processing infrastructure.

To avoid confusion I propose that the "old" ingestion history and monitoring is removed from menus and the "pipeline-*" entries renamed to "running ingestions" and "Ingestion history"

Thoughts?

timrobertson100 commented 4 years ago

See https://github.com/gbif/registry/issues/182

MattBlissett commented 4 years ago

The "old" ingestion history will live on as the XML crawl information (successful/failed pages), and I think currently the only record of a complete failure (i.e. failure to retrieve the DWCA).

Certainly it should be the advanced option, with the pipelines one promoted to first place and renamed.

The old one is also useful for looking back at when a dataset became orphaned, and when the final reasonable ingestion was.

timrobertson100 commented 4 years ago

How about Ingestion for the pipeline stuff, and Crawling for the "old" stuff, since it will still serve as the crawling infrastructure console

MortenHofft commented 4 years ago

So we should rename as below in translations and urls?

In global drawer menu

On datasets

MattBlissett commented 4 years ago

(Using new names)

marcos-lg commented 4 years ago

In case it's useful, both histories (old one and pipelines) are already merged in one endpoint in the API: https://api.gbif.org/v1/ingestion/history/9675f3d4-930d-4e4c-97e1-bcc7e6c5120d/3

MortenHofft commented 4 years ago

On the dataset "Crawling history", the columns "Received | New | Updated | Unchanged | Failed" only make sense before sometime-in-December when we switched over, but are useful for those dates. Either hide the columns since then, or add a note to the page. The columns will all say 0 once message-based ingestion is switched off, and numbers since that day in December are irrelevant.

They won't show if they aren't provided in the API as zeros

muttcg commented 4 years ago

@MortenHofft History buttons on dataset page redirect to https://registry.gbif.org/dataset/7dff2b43-64f8-41f4-b022-8c371a6aef3f/process which displays 404, I presume it must redirect to https://registry.gbif.org/dataset/7dff2b43-64f8-41f4-b022-8c371a6aef3f/ingestion-history

MattBlissett commented 4 years ago

I noticed that and pushed a commit to portal16, but I'll leave it to Thomas or Morten to deploy it.

MattBlissett commented 4 years ago

In case it's useful, both histories (old one and pipelines) are already merged in one endpoint in the API: https://api.gbif.org/v1/ingestion/history/9675f3d4-930d-4e4c-97e1-bcc7e6c5120d/3

I hadn't really appreciated this.

Maybe we remove dataset/crawl-history completely? Looking at a BioCASe dataset and a DWC one, other than nicer formatting I don't see anything in dataset/crawl-history that isn't also in dataset/ingestion-history.

A ?⃝ next to Ingestion History could add:

The ingestion history shows the GBIF system's attempts first to retrieve the data from the dataset endpoint, then to process it. For most datasets, this means downloading a single Darwin Core or ABCD archive file. For BioCASe protocol, DiGIR and TaPIR datasets, it requires making many search requests (“crawling”) until all data has been retrieved. Once the data is downloaded, a series of steps are run to process, interpret and index the data. On 21 November 2019, a new processing system for occurrences was introduced. The structure of the history information is different before this date.

@ahahn, would this be a reasonable description?