alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
309 stars 59 forks source link

Reference documents from structured data scrapes #65

Closed pudo closed 3 years ago

pudo commented 5 years ago

As a user, I want to be able to scrape a source which gives me both structured and unstructured data. For example, while scraping a procurement portal, I might want to download contract metadata, but also a contract document as a PDF file. While both things are possible in memorious, there is currently no way to make things show up in aleph such that the structured data record (e.g., a mapped Contract refers to the ingested Document by its ID).

To solve this, we need some mechanism for importing both the structured and unstructured content into the same collection in such a way that structured entities can refer to the documents by their ID.

sunu commented 3 years ago

aleph_emit now stores the uploaded document id as aleph_id and it is possible to refer to an ingested document by its ID in other structured entities.