Open timrobertson100 opened 4 years ago
The "old" ingestion history will live on as the XML crawl information (successful/failed pages), and I think currently the only record of a complete failure (i.e. failure to retrieve the DWCA).
Certainly it should be the advanced option, with the pipelines one promoted to first place and renamed.
The old one is also useful for looking back at when a dataset became orphaned, and when the final reasonable ingestion was.
How about Ingestion for the pipeline stuff, and Crawling for the "old" stuff, since it will still serve as the crawling infrastructure console
So we should rename as below in translations and urls?
In global drawer menu
On datasets
(Using new names)
[x] On the global "Running crawls" table, the columns from PFS to OE should be removed.
[x] Over-ingested datasets can just be removed from the menu. It might be brought back in some way, but currently can't show anything useful
[ ] On the dataset "Crawling history", the columns "Received | New | Updated | Unchanged | Failed" only make sense before sometime-in-December when we switched over, but are useful for those dates. Either hide the columns since then, or add a note to the page. The columns will all say 0 once message-based ingestion is switched off, and numbers since that day in December are irrelevant.
In case it's useful, both histories (old one and pipelines) are already merged in one endpoint in the API: https://api.gbif.org/v1/ingestion/history/9675f3d4-930d-4e4c-97e1-bcc7e6c5120d/3
On the dataset "Crawling history", the columns "Received | New | Updated | Unchanged | Failed" only make sense before sometime-in-December when we switched over, but are useful for those dates. Either hide the columns since then, or add a note to the page. The columns will all say 0 once message-based ingestion is switched off, and numbers since that day in December are irrelevant.
They won't show if they aren't provided in the API as zeros
@MortenHofft History buttons on dataset page redirect to https://registry.gbif.org/dataset/7dff2b43-64f8-41f4-b022-8c371a6aef3f/process which displays 404, I presume it must redirect to https://registry.gbif.org/dataset/7dff2b43-64f8-41f4-b022-8c371a6aef3f/ingestion-history
I noticed that and pushed a commit to portal16, but I'll leave it to Thomas or Morten to deploy it.
In case it's useful, both histories (old one and pipelines) are already merged in one endpoint in the API: https://api.gbif.org/v1/ingestion/history/9675f3d4-930d-4e4c-97e1-bcc7e6c5120d/3
I hadn't really appreciated this.
Maybe we remove dataset/crawl-history completely? Looking at a BioCASe dataset and a DWC one, other than nicer formatting I don't see anything in dataset/crawl-history that isn't also in dataset/ingestion-history.
A ?⃝ next to Ingestion History could add:
The ingestion history shows the GBIF system's attempts first to retrieve the data from the dataset endpoint, then to process it. For most datasets, this means downloading a single Darwin Core or ABCD archive file. For BioCASe protocol, DiGIR and TaPIR datasets, it requires making many search requests (“crawling”) until all data has been retrieved. Once the data is downloaded, a series of steps are run to process, interpret and index the data. On 21 November 2019, a new processing system for occurrences was introduced. The structure of the history information is different before this date.
@ahahn, would this be a reasonable description?
We are now in the process of tearing down the old rabbit based processing infrastructure.
To avoid confusion I propose that the "old" ingestion history and monitoring is removed from menus and the "pipeline-*" entries renamed to "running ingestions" and "Ingestion history"
Thoughts?