codeforcroatia / imamopravoznati-tjv

TJV Parser is a script that will scrape and parse public authorities file and post online in open format
https://morph.io/codeforcroatia/imamopravoznati-tjv
0 stars 2 forks source link

Add website URL validation #6

Open schlos opened 4 years ago

schlos commented 4 years ago

Where to implement the change: "Morph script" - script that parses TJV register and persists it to db https://morph.io/SelectSoft/blue_gene

Current: Sometimes due to human error, processing is not done properly, skipped or stopped, so resulting Morph.io database has invalid records.

Expected: We want to make sure that Morph.io file has valid items. Add check in Morph.io scraper in Website field:

See also: https://stackoverflow.com/questions/7160737/python-how-to-validate-a-url-in-python-malformed-or-not and https://stackoverflow.com/questions/22238090/validating-urls-in-python

Add website URL validation

After scraping public TJV register, add a RegEx check in website field when value is not null (!=null).

Based on result of a RegEx:

schlos commented 4 years ago

@SelectSoft some failed web addresses are failed just because they show long URL (i.e. path to a page), i.e.

http://www.sunja.hr/ustanove/dječji-vrtic-bambi.html https://zlatar.hr/galerija-izvorne-umjetnosti-zlatar/

Could you check if RegEx can account for this case also to mark it as success?


Some values are failed but they are correct websites, like:

https://inspektorat.gov.hr/ http://web.reakvarner.hr/

Could you check? is it because of ending slash character?


I have one more request, same as in codeforcroatia/imamopravoznati-tjv#5 :

In the fields

email_validation_pass | website_validation_pass | foi_officer_email_validation_pass

currently we have following values:

Could we change wording to use same system? Expected would be something like:

This new wording has more sense, right?

Thanks!