IATI / IATI-Validator-Actual

GNU Affero General Public License v3.0
1 stars 0 forks source link

Time for validator report generation +52 hours? (Oxfam Novib) #16

Closed siemvaessen closed 3 years ago

siemvaessen commented 3 years ago

Noticed that it took +52 hours for the validator to process the new Oxfam Novib XML data (https://iativalidator.iatistandard.org/organisation/onl) downloaded from the Datastore. As Datastore depends on the new reports and a successful validation reports that go with it prior to any processing of new data, we need to understand if the assumption it took +52 hours for the validator to provide that report is correct. This ties into the process and idea Datastore is somehow late with processing, while it simply can not start processing unless it received a valid report! See this conversation: https://github.com/zimmerman-team/iati.cloud/issues/2504

File download (from Datastore):

iati.xml 2020-12-19 07:28 (GMT+1

Report generated:

2020-12-21 20:03 (GMT+1)

Screenshot 2020-12-21 at 21 35 36

cc @PetyaKangalova @rolfkleef

leostolk commented 3 years ago

@siemvaessen, Agreed, point taken, the validation should be faster. Point I'm trying to raise is that besides the 52 hours validation duration, there are 13 days of duration to be explained on the DS side. This morning 22.12 the last update in query builder return on ONL data set is still 06 12 2020 leo @PetyaKangalova

siemvaessen commented 3 years ago

@leostolk correct, ONL data set is 06 12 2020 not sure what the issue is as new the latest ONL data was spotted by Datastore on said timestamp, 2020-12-19 07:28 (GMT+1) and onl data from 06 12 2020 being the dataset prior to the now latest. Once Datastore has spotted new data it gets offered to the Validator first, prior to any processing in Datastore (this is by design as per ToR etc.etc.)

So this means any calculations must be made from that timestamp (2020-12-19 07:28). It looks as if you're stating Datastore and Validator have been waiting for 13 days to do anything, which is factually incorrect.

Suppose the ONL data prior to timestamp (2020-12-19 07:28) was not 06 12 2020 but say 21-04-2020, would you then argue Datastore and Validator were still from 21-04-2020, like 7 months old and the Datastore and Validator 7 months behind? Surely not.

In this case it is expected behaviour that onl data in Datastore is 06 12 2020 (or even 21-04-2020 in my example ) and the new iati.xml is currently queued/being processed in Datastore now that is just received that validator message a few hours ago counting against that timestamp (2020-12-19 07:28). This means that Datastore started processing the latest onl file 2020-12-21 20:03 (GMT+1) after it received the validation report from the Validator, again as per design.

The iati_old.xml however was just processed a few hours ago(!), _which means the iatiold.xml took almost 3 days just to get a simple validation message to the Datastore. I'd argue this should not be the case.

iati_old.xml

2020-12-20 07:26 (GMT+1)

2020-12-22 01:21 (GMT+1)

One of points being that I expect the Datastore review to include a full Validator review as it 100% dependent on Validator functionality and feedback and now it looks like Datastore is bearing the brunt of slowness, which is not the case.

The Datastore review will imo be incomplete w/o the validator being taken into consideration with that work as well. Has been this requested by the IATI Board @leostolk ?

alexlydiate commented 3 years ago

The Downloaded date is simply a date given to the Validator by the Datastore, which does not necessarily or indeed often correspond with the time at which the Datastore first notifies the Validator that the file is available.

The Validator does not record the time at which it first receives notification from the Datastore that the file is available - but that would be a good idea, and we will implement and release that early in the New Year. We will replace the Downloaded date with the accurate time the Validator receives notification of the file from the Datastore, and this will allow us to accurately measure the time it takes the Validator to produce each report.

In the meantime, happy Christmas :)

siemvaessen commented 3 years ago

Thanks @alexlydiate and happy Christmas as well.

On the matter of the +52 hours processing time for data validation and reporting, do you have an idea why this taking so long?

leostolk commented 3 years ago

thanks @alexlydiate and @siemvaessen

ONL activity file iati.xml (https://iatiregistry.org/dataset/onl-activity) is updated every night at 23.00h and pushed to our website 4 hours later at aprox 3.00 in the morning.

ONL teams work in SAP in multiple countries, some even on saturdays and sundays. They alter projects, add new projects, new transactions and SAP gives every activity a new last updated date stamp. IATI Dashboard shows increase in number of activities between 06.12 and 18.12

So in my understanding our https://iatiregistry.org/dataset/onl-activity activity file changes every day.

Registry and dashboard and d-portal acknowledge these changes within less than 12 hours.

The other ONL activity file (https://iatiregistry.org/dataset/onl-activity-old) is more static, although any improvements in IATI export program also changes that file.

I would expect that the combined DS-Val construct, see these changes too and act upon them.

happy festive season to all of you !

leo

alexlydiate commented 3 years ago

@siemvaessen As mentioned, from the Validator perspective we don't actually know that was 52 hours, because we don't know when the Datastore reported the file to the Validator.

We know, and it's well documented, that in the current architecture all reports are produced in series. Therefore if a lot of files are reported at the same time, the queue extends. We are covering old ground here, but as I know both yourself and @leostolk are aware, we're looking to see how we can solve this by processing in parallel with the next version of the Validator.

When we start recording the time at which files are reported to the Validator by the Datastore we'll be able to gauge the time the Validator takes to process files from receipt from the Datastore. Again, as mentioned, that will be early in the New Year. So, let's revisit this when we have those accurate timings.

siemvaessen commented 3 years ago

@alexlydiate how should users interpret the Validator columns then?

I assumed (correct me if I am wrong) that the column 'First Downloaded' means Validator download file from DS location @ 2020-12-19 07:28 (GMT+1) and the column 'Report generated' (including the report link itself) has been generated @ 2020-12-21 20:03 (GMT+1) - time in between those 2 timestamps = approx 52 hours with validation process in progress. Hence the question.

Could you confirm this assumption is correct and if not how users are supposed to read the columns provided in the Validator on org. pages like this https://iativalidator.iatistandard.org/organisation/onl ? Thanks.

Screenshot 2020-12-21 at 21 35 36
alexlydiate commented 3 years ago

That is incorrect - it is as I have stated above:

"The Downloaded date is simply a date given to the Validator by the Datastore, which does not necessarily or indeed often correspond with the time at which the Datastore first notifies the Validator that the file is available."

So, to absolutely clarify, the "First Downloaded date" as displayed is generated by the Datastore, not by the Validator.

It is, specifically, the "date_updated" field as returned by the Datastore API for that file - see the result here for an example: https://iati.cloud/api/datasets/?format=json&page=1&page_size=1.

Indeed, we'd absolutely agree that is not particularly useful information for the end user, and that the Validator should only show dates that it properly understands, and not display dates as received from the Datastore API. As such, we are going to make the changes as stated above in January, and that "First downloaded" column will be replaced by a "Received From Datastore" column.

That is both as much clarity and as much information as I have at this time, but rest assured we will get much greater clarity on this situation following that January release.

alexlydiate commented 3 years ago

Validator Server 1.0.3 and Web 1.4.2 are now deployed to the production site, which includes recording the time it receives the file from the Datastore and communicating that on the organisation pages - see https://iativalidator.iatistandard.org/organisation/aics for a working example (as of 11 January 09:05, screenshot attached)

That illustrates the way it works quite nicely - the Validator received 83 files at 17:46 from the Datastore API (remember, it polls that API once an hour). Those were queued in the order they were received, began processing at 17:50 and completed 19:04, taking a total of 74 minutes - so, just under 1 minute per file. aics

andylolz commented 3 years ago

This is great and super helpful – thank you @alexlydiate.

Apologies for asking here… Is there a box and arrow diagram that illustrates registry-datastore-validator processing? Even a photo of a sketch would be useful, or some other form of documentation.

If this doesn’t already exist, please do not worry, and thanks either way!

alexlydiate commented 3 years ago

None of which I am aware.

andylolz commented 3 years ago

Kk, no worries. Thanks!

leostolk commented 3 years ago

Validator Server 1.0.3 and Web 1.4.2 are now deployed to the production site, which includes recording the time it receives the file from the Datastore and communicating that on the organisation pages - see https://iativalidator.iatistandard.org/organisation/aics for a working example (as of 11 January 09:05, screenshot attached)

That illustrates the way it works quite nicely - the Validator received 83 files at 17:46 from the Datastore API (remember, it polls that API once an hour). Those were queued in the order they were received, began processing at 17:50 and completed 19:04, taking a total of 74 minutes - so, just under 1 minute per file. aics

Thanks @alexlydiate, great improvement ! super useful