IFRCGo / go-frontend

MIT License
21 stars 5 forks source link

Appeal documents not showing on emergency pages #587

Closed ElsaRaunioIFRC closed 5 years ago

ElsaRaunioIFRC commented 5 years ago

The appeal documents are not showing for recent emergencies.

For emergency appeals, we see the appeal documents for Bosnia and Herzegovina Population Movement (created in Deb 2018) but not for the Philippines Measles Outbreak (created in Feb 2019)

With DREF operations, it looks like none of the operations launched after 18 September 2018. The last DREF operation with associated appeal documents is MDRZW013: Zimbabwe Cholera Outbreak (published on 18 September 2018). None of the DREFs published since that have associated appeal documents.

szabozoltan69 commented 5 years ago

Please share the URL-s of the examples.

ElsaRaunioIFRC commented 5 years ago

Bosnia and Herzegovina Population Movement: https://go.ifrc.org/emergencies/3218#documents

Philippines Measles Outbreak: https://go.ifrc.org/emergencies/3446

In the case of Philippines, there is no section for the appeal documents. The following documents should be visible on the page:

https://www.ifrc.org/docs/Appeals/19/IBPHms08022019.pdf http://adore.ifrc.org/Download.aspx?FileId=228846 http://adore.ifrc.org/Download.aspx?FileId=232224 https://www.ifrc.org/docs/appeals/Active/MDRPH032.pdf

szabozoltan69 commented 5 years ago

Yes, in case of 3218 (Bosn. H.) there are these appeals joined in https://prddsgocdnapi.azureedge.net/admin/api/event/3218/change/ : MDRBA011 - This has no documents MDRBA010 - This has 2 DREF appeal documents (Can be checked in https://prddsgocdnapi.azureedge.net/admin/api/appeal/?q=MDRBA01 )

In case of 3446 (Philip.) there is only this appeal joined: https://prddsgocdnapi.azureedge.net/admin/api/event/3446/change/ : MDRPH032 - Has no appeal documents included (Can be checked in https://prddsgocdnapi.azureedge.net/admin/api/appeal/?q=MDRPH032 )

szabozoltan69 commented 5 years ago

This four documents is mentioned as "to be seen": https://www.ifrc.org/docs/Appeals/19/IBPHms08022019.pdf http://adore.ifrc.org/Download.aspx?FileId=228846 http://adore.ifrc.org/Download.aspx?FileId=232224 https://www.ifrc.org/docs/appeals/Active/MDRPH032.pdf

In which appeal are they? (This appeal should contain in "Event" field the referred "Philippines Measles Outbreak".)

I checked the appeal (and appeal-document) ingesting jobs, but they show no error.

1806 existing appeal documents
0 documents without appeals in system
...
3158 current appeals
Creating 0 new appeals
Updating 3157 existing appeals that have been modified
ElsaRaunioIFRC commented 5 years ago

Could there be a problem in the Adore end?

Until the problem is resolved, we will add the appeal documents manually, at least to the major disasters.

dereklieu commented 5 years ago

@ElsaRaunioIFRC @szabozoltan69 this is the data source that Go uses for appeal docs:

https://proxy.hxlstandard.org/data.json?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU%2Fedit%3Fusp%3Dsharing

It runs off of this google doc:

https://docs.google.com/spreadsheets/d/1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU/edit#gid=0

The Philippines polio & measles appeal code is MDRPH032, which is missing from that data source. I would check on Adobe and see why the appeal docs aren't making it to this data source. Once they do they should be pulled in automatically.

SimonbJohnson commented 5 years ago

I believe you now have the script that was scraping the appeals document. I think the error is occurring as the script is using the current appeals from this list: https://docs.google.com/spreadsheets/d/19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU/edit#gid=0

which is no longer maintained after the switch to the permanent Go. Therefore it would update some on-going appeals, but not newly added ones. A temporary fix would be to update that spreadsheet. The ongoing fix would be to update the call to the Go API.

The call that needs to be replaced for the new API https://proxy.hxlstandard.org/data/edit?strip-headers=on&filter03=merge&merge-url03=https%3A//docs.google.com/spreadsheets/d/1rVAE8b3uC_XIqU-eapUGLU7orIzYSUmvlPm9tI0bCbU/edit%23gid%3D0&clean-date-tags01=%23date&filter02=select&merge-keys03=%23meta%2Bid&filter04=replace-map&force=on&filter05=merge&merge-tags03=%23meta%2Bcoverage%2C%23meta%2Bfunding&select-query02-01=%23date%2Bend%3E2016-10-11&cut-include-tags06=%23meta%2Bid&merge-keys05=%23country%2Bname&merge-tags05=%23country%2Bcode&filter01=clean&replace-map-url04=https%3A//docs.google.com/spreadsheets/d/1hTE0U3V8x18homc5KxfA7IIrv1Y9F1oulhJt0Z4z3zo/edit%3Fusp%3Dsharing&filter06=cut&merge-url05=https%3A//docs.google.com/spreadsheets/d/1GugpfyzridvfezFcDsl6dNlpZDqI8TQJw-Jx52obny8/edit%3Fusp%3Dsharing&url=https%3A//docs.google.com/spreadsheets/d/19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU/edit%23gid%3D0

szabozoltan69 commented 5 years ago

Thanks @SimonbJohnson. Which I do not really understand is that in the second URL there is also the string 19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU at the end which is the URL of the non-wanted (not maintained) spreadsheet. Are you sure that this long URL is the correct one?

SimonbJohnson commented 5 years ago

Apologies my poor communication. The second URL is the old call that needs to replaced with a call to the Go API to get the up to date list of current appeals.

szabozoltan69 commented 5 years ago

So could you please give the to-be URL, which I should insert into https://github.com/IFRCGo/go-api/blob/master/api/management/commands/ingest_appeal_docs.py ? Otherwise the Google Spreadsheet filler script's question will not be solved this way – I mean: it's not a futureproof implementation to have a permanently running script which is not in GO area.

dereklieu commented 5 years ago

@SimonbJohnson @szabozoltan69 I think I understand the miscommunication. Here is what has always happened with appeal docs:

Go scrapes the Google Sheet and saves that to it's database, which is then available through the Go API and at go.ifrc.org.

When we built this, there was no API that served as the 'source of truth' for appeal docs, which is why we used the Google Sheet. It seems this is still the case.

Two options:

  1. Start maintaining the google doc again
  2. Turn off the scraper and start inputting appeal docs directly into Go
SimonbJohnson commented 5 years ago

I believe there might be a third/fourth option.

The workflow is currently: unmaintained spreadsheet of appeals -> scraper -> spreadsheet of docs -> Go

It could be quickly replaced with: Go API of appeals -> scraper -> spreadsheet of docs -> Go

With an end target of: Go API of appeals -> scraper -> Go

Does that makes sense?

szabozoltan69 commented 5 years ago

@dereklieu @SimonbJohnson Does http://go-api.ifrc.org/api/appeals give appeal-doc information also? (My understanding was that it gives only appeal information, nothing about the docs.) Or is there other API that can be used to get appeal docs?

Start maintaining the google doc again https://docs.google.com/spreadsheets/d/1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU is maintained in a sense (at least there are 2019 April records in it, so the script runs – probably on s.ifrcgo.org), but this script is erroneous. Not everything that are in https://www.ifrc.org/appeals appears there.

SimonbJohnson commented 5 years ago

I far as I know there is no API for the appeal docs and the scraper has to be used. I'm not 100% sure on this though.

dereklieu commented 5 years ago

@SimonbJohnson your short-term recommendation is already what Go is doing. The problem is that both the spreadsheet of appeals and the spreadsheet of docs are unmaintained.

Apologies, I haven't been as clear as I could have been earlier as I had not realized we were even talking about the spreadsheet of appeals. From the time I started working on it, Go has been using http://go-api.ifrc.org/api/appeals, not the spreadsheet. It only relies on the spreadsheet for docs.

Since the docs spreadsheet is unmaintained, we are having this discussion. Go has up-to-date appeals from the API, but as we've noted there is no API for documents, so IM needs to continue maintaining the docs spreadsheet, or recommend another data source or workflow.

This is the spreadsheet that must be maintained if we want to preserve the original workflow: https://docs.google.com/spreadsheets/d/1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU/edit#gid=0

According to the history, the last edit to the docs spreadsheet by a human was @mmusori on Oct 3 2018. This would roughly align with when new appeal docs stopped appearing in Go.

For reference:

Docs data source for Go: spreadsheet Docs scraper: code

Appeals data source for Go: appeals API Appeals scraper: code

dereklieu commented 5 years ago

I would propose closing this ticket, with the recommendation for IM to either continue inputting appeal docs in Go directly, or start maintaining the docs spreadsheet again.

szabozoltan69 commented 5 years ago

There is one step forward: according to Simon there is a script (file_scraper.py) running on s.ifrcgo.org, https://pastebin.com/ieMe9yPc - this processes www.ifrc.org/en/publications-and-reports/appeals and writes into the above spreadsheet. It should be revised (or/and put to go.ifrc.org domain).

SimonbJohnson commented 5 years ago

@dereklieu - I think there is some confusion. The docs spreadsheet is still being updated, but there is an incorrect input from unmaintained appeals spreadsheet here (this is where the file scraper queries current appeals): https://docs.google.com/spreadsheets/d/1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU/edit#gid=0

Updating the file scraper to point at the Go API rather than the above spreadsheet will start the workflow again.

Long term it should be all done via Go.

Immediate fix - update this spreadsheet: https://docs.google.com/spreadsheets/d/1gJ4N_PYBqtwVuJ10d8zXWxQle_i84vDx5dHNBomYWdU/edit#gid=0

Short term fix - let the file scraper point at Go API for current appeals

Long term fix - Combine file_scraper and Go document list importer into Go

szabozoltan69 commented 5 years ago

I think that this is the clue: ...this is where the file scraper queries current appeals: ... Spreadsheet... By my opinion file_scraper queries appeals (and docs?) from here: http://www.ifrc.org/en/publications-and-reports/appeals - it's on line nr 29 https://pastebin.com/ieMe9yPc .

dereklieu commented 5 years ago

@SimonbJohnson @szabozoltan69 I am confused. I haven't interacted with this separate file scraper before. Apologies for my rantings. I likely won't be of any help debugging it, however I believe migrating it's functionality to Go would be beneficial, just to keep these things in one place.

SimonbJohnson commented 5 years ago

Line 21 - Is what needs to be replaced (point to old spreadsheet) This grabs the list of current appeals as a list of list from the unmaintained spreadsheet as below.

https://proxy.hxlstandard.org/data.json?strip-headers=on&filter03=merge&merge-url03=https%3A//docs.google.com/spreadsheets/d/1rVAE8b3uC_XIqU-eapUGLU7orIzYSUmvlPm9tI0bCbU/edit%23gid%3D0&clean-date-tags01=%23date&filter02=select&merge-keys03=%23meta%2Bid&filter04=replace-map&force=on&filter05=merge&merge-tags03=%23meta%2Bcoverage%2C%23meta%2Bfunding&select-query02-01=%23date%2Bend%3E2016-10-11&cut-include-tags06=%23meta%2Bid&merge-keys05=%23country%2Bname&merge-tags05=%23country%2Bcode&filter01=clean&replace-map-url04=https%3A//docs.google.com/spreadsheets/d/1hTE0U3V8x18homc5KxfA7IIrv1Y9F1oulhJt0Z4z3zo/edit%3Fusp%3Dsharing&filter06=cut&merge-url05=https%3A//docs.google.com/spreadsheets/d/1GugpfyzridvfezFcDsl6dNlpZDqI8TQJw-Jx52obny8/edit%3Fusp%3Dsharing&url=https%3A//docs.google.com/spreadsheets/d/19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU/edit%23gid%3D0

This is then used by the file scraper to populate the appeals doc spreadsheet from querying the website

If we can replace the above with the actual current list from Go API it will update the Docs spreadsheet. Ideally we would want it to populate Go directly and so merge this script with with the second one that imports the docs spreadsheet.

szabozoltan69 commented 5 years ago

@SimonbJohnson @dereklieu I agree that migrating file_scraper's functionality to Go would be beneficial, just to keep these things in one place. On the other hand I don't like the word: 'unmaintained' spreadsheet. It is not unmaintained, in the sense that there are 2019 April records in it, inserted by the file_scraper. So it is heavily used (also by ingest_appeal_docs)

szabozoltan69 commented 5 years ago

(So – simply migrating the file_scraper will not solve our problem.)

SimonbJohnson commented 5 years ago

The unmaintained spreadsheet is the appeals list. Hasn't mean updated since migrating to Go in September 2018 (previously there was no api available) https://docs.google.com/spreadsheets/d/19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU/edit#gid=0

So the appeals doc spreadsheet will update for ongoing appeals that started before September 2018, but not for appeals that started after.

To solve the problem the file_scraper needs to point to the Go API to get the updated list of current appeals.

szabozoltan69 commented 5 years ago

Appeals are not in scope now (we have fresh script-system to process them). The focus is on appeal-docs.

@SimonbJohnson – do you have the control (on s.ifrcgo.org) the script file_scraper.py? Could you please share with me the crontab details concerning it (maybe in private)? And could you stop running when we decide it?

SimonbJohnson commented 5 years ago

Yes, agree not in scope, but the file_scraper is still pointing at the unmaintained appeals doc rather than the Go API on line 21 which causes the problem. The minimum that needs to be updated is line 21, but better to consolidate into 1 script rather than 2 scripts plus google spreadsheet.

The script is run by a bash script in the daily cron folder. Happy to stop it anytime.

szabozoltan69 commented 5 years ago

Go API does not contain information about appeal_docs, unfortunately. So we have to find a way to scan http://www.ifrc.org/en/publications-and-reports/appeals in the future also.

SimonbJohnson commented 5 years ago

I think there is still confusion about the process.

The file_scraper still reads the file from the page ( http://www.ifrc.org/en/publications-and-reports/appeals) correctly as far as I can tell (it ran with no errors at least).

The process has broken down as the input to the file_scraper is incorrect (the list of current appeals). Everyday it gets passed a list from the old spreadsheet (https://docs.google.com/spreadsheets/d/19pBx2NpbgcLFeWoJGdCqECT2kw9O9_WmcZ3O41Sj4hU/edit#gid=0) that is unmaintained since September, so the current appeals list it gets passed is old and it will only update the docs for appeals that existed at that date (September 2018).

Go does has an up to date list of appeals. So replacing line 21 with an appropriate Go API call (and maybe adjusting the script for new input format) will restart file_scraper working with all current appeals.

The scraping should be working. The appeals list input into the script is old.

szabozoltan69 commented 5 years ago

Please forget referring to appeals. We need to fix ingesting appeal_docs only. I don't care about appeals, they work fine.

dereklieu commented 5 years ago

@SimonbJohnson I think I understand now what the issue is, but perhaps we should set up a call to get on the same page. @szabozoltan69 @SimonbJohnson I think it'll be quick ~10 minutes or less, maybe 11AM ET tomorrow?

SimonbJohnson commented 5 years ago

Yes that works for me.

dereklieu commented 5 years ago

Recap of our conversation:

  1. The appeal docs scraper depends on having an up-to-date list of appeal IDs. It uses this as parameters to scrape appeal docs.
  2. The appeal docs scraper gets information about appeals from one of the Google Sheets mentioned here, but that sheet is no longer maintained.
  3. Hence, the scraper is also not adding any new docs to the other Google Sheet.
  4. Short term, we can use the appeal API endpoint from Go to provide this list of up-to-date appeal IDs.
  5. Long term, we should rewrite the scraper to live within Go, and skip the Google Docs altogether. This means the current ingest_appeal_docs.py scraper will scrape https://www.ifrc.org/en/publications-and-reports/appeals/ directly, using the code that @SimonbJohnson has shared as a model.
szabozoltan69 commented 5 years ago

I've executed point 4., (Short term solution) – updating manually the Spreadsheet with 130 new rows (below 1808). My method was:

This way it can happen, that there was already an appeal code in the spreadsheet with missing appeal-doc (because there can be more) – if this kind of error arrives, I can insert the data manually.

ElsaRaunioIFRC commented 5 years ago

@szabozoltan69 I'm raising this issue again as I see that the appeal documents have not appeared on the latest emergency pages, e.g. Paraguay Floods and Cameroon Population Movement.. Is there a problem with the above method?

szabozoltan69 commented 5 years ago

@ElsaRaunioIFRC My last solution (Apr 13) was a (one-time) short term solution, manually collecting the information from https://www.ifrc.org/en/publications-and-reports/appeals and putting it to the table. The long term solution was planned to be done by Derek, who left the Developmentseed last Friday. His colleague, Sanjay began to work with this. So, long story short: recently there is a problem with this appeal-document ingestion.

szabozoltan69 commented 5 years ago

I can give a next (one-time) manual collecting solution, which will be ready in an hour, but this method is not really future-proof. Hopefully Sanjay will find a nicer solution in a few weeks.

szabozoltan69 commented 5 years ago

The manual correction is done, I've put in these documents: Bangladesh - displacement due to embankment collap (MDRBD021) Bolivia - Floods (MDRBO012) Bosnia and Herzegovina - Population Movement (MDRBA011) Cameroon - Population Movement (MDRCM027) Comoros - Tropical Cyclone Kenneth (MDRKM007) DPR Korea - Drought & Food Insecurity (MDRKP013) Ethiopia - Population Movement (MDRET020) Haiti - Earthquake (MDRHT015) India - Cyclone Fani (MDRIN022) Paraguay - Floods (MDRPY020) Sri Lanka - Easter Sunday Attack (MDRLK009) Syria - Floods (MDRSY004) Tanzania - Tropical Cyclone Kenneth (MDRTZ023)

batpad commented 5 years ago

I have made some progress to modify the ingest_appeal_docs.py script in the GO-API to:

This should make the ingest_appeal_docs.py script self contained and sustaining and not requiring dependencies on Google Sheets being updated by different scripts running in different places.

I will push my branch in a bit and link it here so that @szabozoltan69 you can give it a test.

batpad commented 5 years ago

So this PR should remove the need for the manual update of Documents as well as the Google Documents that were being updated by different scripts: https://github.com/IFRCGo/go-api/pull/397

I have removed the dependencies on any Google Docs - now the ingest_appeal_docs script just gets the Appeal Codes from the database, and calls http://www.ifrc.org/en/publications-and-reports/appeals/ itself with each appeal code, gets the Document data, and adds the missing Documents to the database.

This should make it so that the Documents are then automatically updated whenever the ingest_appeal_docs cronjob runs on the server.

If this looks good, @szabozoltan69 we can coordinate on merging it and verifying that things run fine on production.

szabozoltan69 commented 5 years ago

Thanks to @batpad Sanjay the task is done and works fine on production also.

nanometrenat commented 4 years ago

968 is latest ticket re Appeal Documents