WorldModelers / Integration

This repository contains information related to World Modelers software integration.
1 stars 2 forks source link

Metadata in collections that needs to be standardized #8

Open kwalcock opened 5 years ago

kwalcock commented 5 years ago

These are metadata fields that UA has at times included with the metadata. Except for the last one in the list, this is a subset of what was found in the PDFs (pdfinfo) plus what may have been provided in a spreadsheet, and possibly with some renaming to align the PDF metadata with the data in the Doc17k collection.

Doc10 Set(personalAuthor_en, corporateAuthor_en, Famine Early Warning Systems Network (FEWS NET), title, Keywords, creation date)

Doc52 Set(publicationDate, personalAuthor_en, series_en, corporateAuthor_en, title, Keywords, creation date, vroy@fews.net)

Doc350 Set(title, publisherName, creation date)

Doc500 Set(title, publisherName, creation date)

Doc17k - these come not from PDFs but from a database Set(accessRights, DraftPages, localizedTranslationURL_ms, _dlc_DocIdItemGuid, AG, TitleNumbering, agrovoc_es, Author, defaultTranslationURL_cs, seriesName_ru, localizedTranslationURL_cs, division_es, series_id, defaultTranslationURL_ne, defaultTranslationURL_mg, gsaentity_google_type, thumb100, localizedTranslationURL_mg, region_zh, personalAuthor_es, Minor Version, RESPONSE_SENDER_NAME, localizedTranslationURL_de, session, GTS_PDFXConformance, ShareDoc, MAIL_MSG_ID1, department_zh, localizedCardURL, _EmailSubject, defaultTranslationURL_ko, defaultTranslationURL_zh, localizedTranslationURL_ko, Language, localizedTranslationURL_zh, defaultTranslationURL_fj, Maintained by, defaultTranslationURL_so, region_ru, ParagraphNumberingLegal, jobNumber, localizedTranslationURL_so, Major Version, localizedTranslationURL_to, docRepCollection, author_id, KeyWords, Title, localizedTranslationURL_hy, defaultTranslationURL_ta, defaultTranslationURL_sk, region_en, defaultTranslationURL_ba, seriesName_es, Subject, localizedTranslationURL_sm, localizedTranslationURL_ca, docType_en, defaultTranslationURL_ka, country_es, localizedTranslationURL_ka, agrovoc_id, defaultTranslationURL_ur, meeting_ru, localizedTranslationURL_ur, MTEquationSection, issn, Universal PDF, 213, localizedTranslationURL_mn, defaultTranslationURL_hi, corporateAuthor_id, Description, agrovoc_en, GTS_PDFXVersion, defaultTranslationURL_sr, department_ar, division_en, RapportAuteur, sharepoint_id, defaultTranslationURL_ar, thumb200, _PreviousAdHocReviewCycleID, pages, FilePreviewStatus, MAIL_MSG_ID2, series_ar, robots, division, defaultTranslationURL_ms, mobiUrl, series_es, AssocFileName, Last Modified, allLanguages, distribution, Version, _EmailStoreID, seriesName_id, ICNAppPlatform, defaultTranslationURL_sl, localizedTranslationURL_sl, localizedTranslationURL_rn, localizedTranslationURL_es, collection_ar, seriesDetail, note, Mendeley Citation Style_1, workUuid, country_ru, gsaentity_City, customTitle_es, AGA, seriesName_en, collection_es, ICNAppVersion, first_open, meeting_ar, SourceModified, defaultTranslationURL_dual, Symbol1, defaultTranslationURL_lo, country_en, region_id, localizedTranslationURL_ne, defaultTranslationURL_to, gsaentity_google_language, e-isbn, region_fr, gsaentity_file_type, UseDefaultLanguage, _DocHome, defaultTranslationURL_de, defaultTranslationURL_km, department_id, defaultTranslationURL_id, localizedTranslationURL_mk, PAA activities, PDFVersion, department_fr, Generator, localizedTranslationURL_dual, abstract_es, defaultTranslationURL_fr, defaultTranslationURL_sm, localizedTranslationURL_fr, defaultTranslationURL_et, cardURL, localizedTranslationURL_uk, Operator, corporateAuthor_ar, defaultTranslationURL_da, localizedTranslationURL_sw, localizedTranslationURL_da, customTitle_zh, localizedTranslationURL_et, sdg, title, Keywords, defaultTranslationURL_mt, DirectFormatting, DocumentToConvert, meeting_id, defaultTranslationURL_hy, Direction, division_fr, localizedTranslationURL_sk, localizedTranslationURL_ru, language, TransitPubID, docType_zh, author_en, defaultTranslationURL_hu, localizedTranslationURL_hu, gsaentity_google_lastmod, agrovoc_ru, collection_fr, ContentTypeId, edition, LinksUpToDate, localizedTranslationURL_en, division_ru, confNumber, abstract_zh, localizedTranslationURL_si, defaultTranslationURL_mn, _AdHocReviewCycleID, database_id, defaultLanguage, author, RapportTaalDocument, revision date, project name, _EmailEntryID, department_es, series, defaultTranslationURL_es, personalAuthor_en, defaultTranslationURL_sv, region_ar, country_id, HeaderDone, _AuthorEmailDisplayName, country_fr, _AuthorEmail, Trapped, defaultTranslationURL_th, docType_ar, PTEX.Fullbanner, Comments, DocType, defaultTranslationURL_no, localizedTranslationURL_no, DocSecurity, gsaentity_Country, abstract, agrovoc_ar, defaultTranslationURL_pt, localizedTranslationURL_pt, division_ar, gsaentity_Location, defaultTranslationURL_ky, localizedTranslationURL_lo, localizedTranslationURL_ky, RapportTitel, localizedTranslationURL_hr, series_zh, geoSelfGoverning, localizedTranslationURL_fa, Afdrukken, department_ru, ICNAppName, codeMantra, LLC, localizedTranslationURL_id, IniName, Your guide to the eatwellplate , localizedTranslationURL_vi, subtitle, series number, _ReviewCycleID, customTitle_fr, LastSaved, RapportDatum, defaultTranslationURL_ki, collection_zh, defaultTranslationURL_si, division_id, placeOfPublication, author_zh, description, meeting_zh, HyperlinksChanged, docType_id, series_en, Category, Docear4Word_StyleTitle, AuthoritativeDomain[2], docType_fr, customTitle_ru, Build, country_ar, collection_ru, defaultTranslationURL_el, keywords, localizedTranslationURL_is, localizedTranslationURL_el, defaultTranslationURL_te, localizedTranslationURL_te, Created, defaultTranslationURL_ml, sortpubdate, customTitle_en, agrovoc_zh, localizedTranslationURL_nl, GENERATOR, collection_en, localizedTranslationURL_ta, customTitle, isbn, CreationDate--Text, User, Division, corporateAuthor_zh, personalAuthor_zh, defaultTranslationURL_ja, ElsevierWebPDFSpecifications, gsaentity_country_content, defaultTranslationURL_he, abstract_ru, localizedTranslationURL_he, Translated, SjabloonVersieDatum, EcoNote, localizedTranslationURL_sv, uuid, defaultTranslationURL_it, localizedTranslationURL_it, Company, AppVersion, author_ar, abstract_en, region_es, Status, localizedTranslationURL_th, localizedTranslationURL_sr, doi, localizedTranslationURL_ar, WPS-JOURNALDOI, docType_es, seriesName_zh, geoNonSelfGoverning, defaultTranslationURL_fa, file_length, publisherName, OLV0_XMD_PAGE_COUNT, MTWinEqns, defaultTranslationURL_mk, otherEntitiesInvolved, CreatorVersion, defaultTranslationURL_vi, epubUrl, XPressPrivate, visibility, WPS-ARTICLEDOI, defaultTranslationURL_uk, meeting_es, defaultTranslationURL_sw, docType, personalAuthor_ar, series_fr, Creator, abstract_ar, defaultTranslationURL_ru, WkDocID, faoProject, defaultTranslationURL_pl, TaalDocument, publicationDate, localizedTranslationURL_pl, Prepared, defaultTranslationURL_is, e-issn, series_ru, collection_id, department_en, localizedTranslationURL_ki, MTEquationNumber2, corporateAuthor_es, defaultTranslationURL_sq, gsaentity_google_encoding, localizedTranslationURL_sq, defaultTranslationURL_nl, seriesName_ar, author_fr, docType_ru, meeting_fr, ScaleCrop, AuthoritativeDomain[1], division_zh, RapportVoettekst, localizedTranslationURL_km, _AssemblyLocation, _AssemblyName, defaultTranslationURL_ca, EMAIL_OWNER_ADDRESS, defaultTranslationURL_ro, gsaentity_doc_source, Base Target, localizedTranslationURL_ro, PXCViewerInfo, localizedTranslationURL_fj, WPS-PROCLEVEL, SOURCE, Type, localizedTranslationURL_mt, author_ru, defaultTranslationURL_lv, localizedTranslationURL_ml, localizedTranslationURL_lv, abstract_fr, agrovoc_fr, year, JobNo, cardText, LCID, meetingDocSymbol, localizedTranslationURL_ba, corporateAuthor_fr, meeting_en, personalAuthor_fr, Papiersoort, localizedTranslationURL_ja, homepage, defaultTranslationURL_tr, localizedTranslationURL_tr, creation date, database_en, gsaentity_file_type_content, project code, gsaentity_Date, ContentType, corporateAuthor_ru, localizedTranslationURL_hi, country_zh, publisher, personalAuthor_ru, author_es, defaultTranslationURL_hr, corporateAuthor_en, seriesName_fr, customTitle_ar, defaultTranslationURL_rn, Subjects, defaultTranslationURL_fi, localizedTranslationURL_fi, alternativeVersion, ADBE_ProducerDetails)

kwalcock commented 5 years ago

These are typical keys in the PDFs:

Author CreationDate Creator Encrypted Filesize Form JavaScript Keywords ModDate Optimized Pagerot Pagesize Pages PDFversion Producer Subject Suspects Tagged Title UserProperties

brandomr commented 5 years ago

@kwalcock this is very helpful!

I'm curious which PDF extractor you used for the migration documents--I'm able to get some pdfinfo type data from the PDFs using Tika but it doesn't give things like publisher or title, but often does have author.

kwalcock commented 5 years ago

Here are some with titles that were extracted with pdfinfo from the recent collection of 358 documents from the hackathon at https://drive.google.com/drive/u/2/folders/1H3a1hZESh9UADoejUVlUJDJLZOJLtSU7 .

DTM_South_Sudan_BalietCounty-_Upper_Nile_State_Village_Assessment_Survey_Apr-17.pdf Title: 20170830 Baliet VAS

DTM_South_Sudan_Wau_POCAA_Site_Rapid_Intentions_Survey_of_NewArrivals-_28_April_2017_May-17.pdf Title: 20170518_Wau_Intention_Survey_Report

DTM_South_Sudan_Wau_Town_AssessmentSurvey(VAS)_Nov-17.pdf Title: 20171204 Wau Town VAS.ai

Famine,__Northeast_Nigeria,_Somalia,_South_Sudan,_and_Yemen,_Thematic_Report_22-May-17.pdf Title: Présentation PowerPoint

FAO_WFP_CROP_AND_FOOD_SECURITY_ASSESSMENT_MISSION_TO_SOUTH_SUDAN_26-May-17.pdf Title: Special Report: FAO/WFP Crop and Food Security Assessment Mission to the South Sudan, 26 May 2016

More shortly about publisher...

kwalcock commented 5 years ago

It looks to me like publisherName always came from the spreadsheets that accompanied the PDFs or from the large Doc17k collection that didn't derive from PDFs. Although custom fields can be added to a PDF, I don't recall ever seeing seeing a publisher or anything else there. Title is definitely in the stardard list, though.

PDFProperties

kwalcock commented 5 years ago

I hadn't looked very closely at the HTML documents that are in the recent collection, but they do contain some interesting data including sometimes publisher. Most information, with the important exception of anything related to document creation times, isn't especially important for reading, but more might interest someone who is searching for documents.

<meta name="keywords" content="cccm, iom, Displacement Tracking and Monitoring Unit, Camp Coordination and Camp Management, iom south sudan, International Organization for Migration, Displacement Tracking matrix, south sudan" />
<meta itemprop="description" content="Uganda&#39;s economy is being pushed to the wall following the renewed conflict in South Sudan that is currently sending waves across the region Uganda is one of South Sudan&#39;s biggest trading partners. The county&#39;s revenue body, Uganda Revenue Authority (URA), targets Shs200 million monthly in taxes at the South Sudan-Uganda border at Elegu." />
<meta itemprop="name" content="East Africa: South Sudan Conflict Hits Uganda's Economy" />
<meta name="description" content="Uganda&#39;s economy is being pushed to the wall following the renewed conflict in South Sudan that is currently sending waves across the region Uganda is one of South Sudan&#39;s biggest trading partners. The county&#39;s revenue body, Uganda Revenue Authority (URA), targets Shs200 million monthly in taxes at the South Sudan-Uganda border at Elegu." />
<meta name="keywords" content="Africa, news, politics, economy, trade, business, sports, current events, travel, Economy, Business and Finance, Conflict, Peace and Security, East Africa, South Sudan, Trade, Uganda" />
<meta name="twitter:description" content="Uganda&#39;s economy is being pushed to the wall following the renewed conflict in South Sudan that is currently sending waves across the region Uganda is one of South Sudan&#39;s biggest trading partners. The county&#39;s revenue body, Uganda Revenue Authority (URA), targets Shs200 million monthly in taxes at the South Sudan-Uganda border at Elegu." />
<meta name="twitter:site" content="@allafrica" />
<meta name="twitter:title" content="East Africa: Juba Conflict Hits Uganda's Economy" />
<meta property="article:modified_time" content="2017-05-11T05:57:57+0000" />
<meta property="article:published_time" content="2017-05-11T05:51:18+0000" />
<meta property="article:publisher" content="https://www.facebook.com/pages/allAfricacom/98946450029" />
<meta property="article:section" content="News" />
<meta property="article:tag" content="Economy, Business and Finance" />
<meta property="article:tag" content="Conflict, Peace and Security" />
<meta property="article:tag" content="East Africa" />
<meta property="article:tag" content="South Sudan" />
<meta property="article:tag" content="Trade" />
<meta property="article:tag" content="Uganda" />
<meta property="og:description" content="Uganda&#39;s economy is being pushed to the wall following the renewed conflict in South Sudan that is currently sending waves across the region Uganda is one of South Sudan&#39;s biggest trading partners. The county&#39;s revenue body, Uganda Revenue Authority (URA), targets Shs200 million monthly in taxes at the South Sudan-Uganda border at Elegu." />
<meta property="og:site_name" content="allAfrica.com" />
<meta property="og:title" content="East Africa: Juba Conflict Hits Uganda's Economy" />
<meta property="og:type" content="article" />
<meta name="syndication-source" content="http://www.thenewhumanitarian.org/opinion/2017/11/22/why-doesn-t-south-sudan-s-refugee-exodus-spur-east-africa-action-0" />
<meta name="description" content="If you thought the depopulation of South Sudan would propel its neighbours to act, think again!" />
<meta property="og:site_name" content="The New Humanitarian" />
<meta property="og:type" content="article" />
<meta property="og:title" content="Why doesn’t South Sudan’s refugee exodus spur East Africa to action?" />
<meta property="og:description" content="If you thought the depopulation of South Sudan would propel its neighbours to act, think again!" />
<meta property="og:updated_time" content="2019-04-16T18:27:03+01:00" />
<meta name="twitter:title" content="Why doesn’t South Sudan’s refugee exodus spur East Africa to action?" />
<meta name="twitter:description" content="If you thought the depopulation of South Sudan would propel its neighbours to act, think again!" />
<meta property="article:published_time" content="2017-11-22T12:46:26+00:00" />
<meta property="article:modified_time" content="2019-04-16T18:27:03+01:00" />
<meta itemprop="name" content="Why doesn’t South Sudan’s refugee exodus spur East Africa to action?" />
<meta itemprop="description" content="If you thought the depopulation of South Sudan would propel its neighbours to act, think again!" />
brandomr commented 5 years ago

@kwalcock thanks for those examples, of title extraction--I am able to recreate that. The HTML is interesting, I'm sure it could be parsed out but since it's likely quite heterogeneous (in terms of schema) it might be a challenge more suited for data collection efforts that are kicking off shortly @jgawrilo

brandomr commented 5 years ago

@kwalcock I've updated the Jupyter Notebook I used to index the hackathon documents to Elasticsearch to use 3 Python-based extraction options:

  1. Tika
  2. PyPDF2
  3. BeautifulSoup

Tika extracts pdfinfo so I formed documents compliant to the schema outlined here, indexed the parsed documents to Elasticsearch and stored the raw documents on S3.

There is obviously much more metadata that could be extracted from these but wanted to make available some reference code for the short-term for performing extraction a few different ways.