alan-turing-institute / defoe

Code to analyse books and newspapers data using Apache Spark.
MIT License
17 stars 3 forks source link

Fix Issue false positives #15

Closed mikej888 closed 5 years ago

mikej888 commented 5 years ago

If defoe.papers.issue.Issue fails to parse an XML document an object is still built with empty strings, lists etc as fields and datetime.now() as a date. This can give misleading results when running queries in which such documents are encountered e.g.

nohup spark-submit --py-files defoe.zip  defoe/run_query.py ~/data/papers/files.all.txt papers defoe.papers.queries.articles_per_year -n 144 > log.txt &
cat results.yml
{1714: 115, 1715: 302, ....,1950: 7155, 2019: 0}
mikej888 commented 5 years ago

8c14a46ac6accd9e2446081999ef8640fac2cd65 to 39552d033be868ee1815a5a28e3e079a561bc465 has refactored code to create objects and filter these on the basis of successful or failed object creation. Files giving rise to failures are recorded, along with the error message, in a separate YAMLfile e.g. rerunning the above:

cat results.yml
{1714: 115, 1715: 302, 
...
, 1950: 7155}
cat errors.yml     
- [/.../some-issue.xml,
  'Document is empty, line 1, column 1 (line 1)']