CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

A non-valid email is breaking the fetch stage of harvest #199

Closed higorspinto closed 4 years ago

higorspinto commented 4 years ago

Edgov harvest source has an invalid entry that broke the fetch stage.

A data profile with name: presidents-fy-2010-budget-request-for-the-u-s-department-of-education contains an invalid HelpDesk Email value of Office of Planning, Evaluation and Program Development (OPED)@ed.gov which breaks the harvest process.

Acceptance Criteria

Tasks

Analysis

Emails are being set up using the target department of the transformer process (e.g. octae, ocr):

contactPoint['hasEmail'] = f'mailto:{target_dept}@ed.gov'

But for the edgov transformer, the name of the publisher is being set up as an email.

 contactPoint['hasEmail'] = f"mailto:{data['publisher']['name']}@ed.gov"

When the name of the publisher passes the email validation (e.g. ocr, octae) it works fine. But when the name of the publisher is long and doesn't pass the email validation the fetch stage breaks.

Solution

A solution that can be done only on the transformer step is to check the length of the publisher's name. If the publisher has a long name we can assign a short name to compose the email that can pass the validation of the harvest fetch step.

higorspinto commented 4 years ago

time spent: 6h