codeforcroatia / imamopravoznati-tjv

TJV Parser is a script that will scrape and parse public authorities file and post online in open format
https://morph.io/codeforcroatia/imamopravoznati-tjv
0 stars 2 forks source link

Add email validation #5

Open schlos opened 4 years ago

schlos commented 4 years ago

Where to implement the change: "Morph script" - script that parses TJV register and persists it to db https://morph.io/SelectSoft/blue_gene

Current: Sometimes due to human error, processing is not done properly, skipped or stopped, so resulting Morph.io database has invalid records.

Expected: Add checks in Morph.io scraper in each Email field:

See also: https://www.pythoncentral.io/how-to-validate-an-email-address-using-python/

Add email validation

Workflow:

After scraping public TJV register, add a RegEx check in each email field when value is not null (!=null).

Based on result of a RegEx, write to the status field for that processed row:

SelectSoft commented 4 years ago

IF all email fields passed the validation, write status: updated IF all email fields failed the validation, write status: failed

If one or two of them failed the validation then what should be the status passed or failed

schlos commented 4 years ago

@SelectSoft good point. I've updated definition as:

IF any of email fields failed the validation, write status: failed

SelectSoft commented 4 years ago

ok

schlos commented 4 years ago

saifullahalam, Jun 10, 11:29 AM

I did the validation for mail, foi_officer_mail and website.... but the number of attribute website_1 website_2 mail_1 mail_2 foi_officer_mail_1 and 2 are not fixed......sometimes they are 2 and sometimes they are more then 2..... not a fixed value.... I didnt not find a way to validate change variables....

schlos commented 4 years ago

New code for email validation: https://github.com/SelectSoft/blue_gene/blob/master/scraper.py#L217 to https://github.com/SelectSoft/blue_gene/blob/master/scraper.py#L225

schlos commented 4 years ago

@SelectSoft I've added this suggestion, but I don't know correct syntax so I wrote a pseudo code to give you general idea. Could you check f this is possible?

Current:

https://github.com/SelectSoft/blue_gene/blob/master/scraper.py#L217 to https://github.com/SelectSoft/blue_gene/blob/master/scraper.py#L225

Proposed change / pseudo code - to add validation for all 8 fields with email value:

    if(isValidEmail(allData['email'][x]) and base_data["email"].notnull):
        email_validation_pass = "true"

    elseif(base_data["email"].isnull):
        email_validation_pass = "nan"

    else:
        email_validation_pass = "fail"

    if(isValidEmail(allData['foi_officer_email'][x]) and base_data["foi_officer_email"].notnull):
        foi_officer_email_validation_pass = "true"

    elseif(base_data["foi_officer_email"].isnull):
        foi_officer_email_validation_pass = "nan"

    else:
        foi_officer_email_validation_pass = "fail"

    if(isValidEmail(allData['email_1'][x]) and base_data["email_1"].notnull):
        email_1_validation_pass = "true"

    elseif(ibase_data["email_1"].isnull):
        email_1_validation_pass = "nan"

    else:
        email_1_validation_pass = "fail"

    if(isValidEmail(allData['email_2'][x]) and base_data["email_2"].notnull):
        email_2_validation_pass = "true"

    elseif(base_data["email_2"].isnull):
        email_2_validation_pass = "nan"

    else:
        email_2_validation_pass = "fail"

    if(isValidEmail(allData['email_3'][x]) and base_data["email_3"].notnull):
        email_3_validation_pass = "true"

    elseif(base_data["email_3"].isnull):
        email_3_validation_pass = "nan"

    else:
        email_3_validation_pass = "fail"

    if(isValidEmail(allData['foi_officer_email_1'][x]) and base_data["foi_officer_email_1"].notnull):
        foi_officer_email_1_validation_pass = "true"

    elseif(base_data["foi_officer_email_1"].isnull):
        foi_officer_email_1_validation_pass = "nan"

    else:
        foi_officer_email_1_validation_pass = "fail"

    if(isValidEmail(allData['foi_officer_email_2'][x]) and base_data["foi_officer_email_2"].notnull):
        foi_officer_email_2_validation_pass = "true"

    elseif(base_data["foi_officer_email_2"].isnull):
        foi_officer_email_2_validation_pass = "nan"

    else:
        foi_officer_email_2_validation_pass = "fail"

    if(isValidEmail(allData['foi_officer_email_3'][x]) and base_data["foi_officer_email_3"].notnull):
        foi_officer_email_3_validation_pass = "true"

    elseif(base_data["foi_officer_email_3"].isnull):
        foi_officer_email_3_validation_pass = "nan"

    else:
        foi_officer_email_3_validation_pass = "fail"

    if(email_validation_pass != "fail" or foi_officer_email_validation_pass != "fail" or email_1_validation_pass != "fail" or email_2_validation_pass != "fail" or email_3_validation_pass != "fail" or foi_officer_email_1_validation_pass != "fail" or foi_officer_email_2_validation_pass != "fail" or foi_officer_email_3_validation_pass != "fail"):
        allData['email_status'][x] = "updated"

    else:
        allData['email_status'][x] = "failed"
schlos commented 4 years ago

@SelectSoft please check following:

line with VAT number 37927943647 has email in the field 'foi_officer_email' = '[CENSORSED]@ekokong.hr' --> but result in 'email_validation_pass' = 'nan' (nan means no email in any email fields)

there are multiple line with this example where one email is present but result is 'nan'. Could you check it out?

schlos commented 4 years ago

Actually - ignore last, I see you've added additional field named 'foi_officer_email_validation_pass' for this field validation. This looks fine.

schlos commented 4 years ago

@SelectSoft functionality wise all looks good with email validation.

I have additional request, in the fields

email_validation_pass | website_validation_pass | foi_officer_email_validation_pass

currently we have following values:

Could we change wording to use same system? Expected would be something like:

Thanks!