Open bmenn opened 7 years ago
Is the work encapsulated by your issue description the following?
- Retrival of data to stored (presumably text-based webpage data
For now, I think this is true. Might change if we encounter a page that is dynamically rendered
- Storage of unmodified source webpage data
- Source date transformation
Yep
- Storage of data elements to be used in ht-analytics
anidata/ht-analytics is probably not going to be the primary consumer of this data for the time being. Data ingestion would come from anidata/palantiri or a future version of it. Data consumption will primarily be by the anidata/ht-archive project.
Data ingestion would come from anidata/palantiri or a future version of it
This issue would not encapsulate "Retrieval of data to be stored". The data retrieval would be a concern of palantiri instead. It may need to be modified such that it retrieves source unmodified when crawling the webs, if I understand how it works today.
So this issue would begin with the assumption that the webpage data is available and stored. The work performed begins at data consumption, for purposes of transformation, of the webpage data source and then persistence of the transformed data into another store that would be accessed by anidata/ht-archive. am I closer?
Good catch. Yes, data retrieval would be the responsibility of palantiri. palantiri does need to changed so it does not do any post-processing. Input for ht-etl would be the unmodified webpage source from palantiri, and the only requirements for output on ht-etl right now would come from ht-archive.
Since no major work is being done on palantiri right now, if you need to make any schema change go ahead with those under anidata/gcloud-infrastructure#2.
Instead of doing some of the heavylifting and data processing in anidata/palantiri, anidata/ht-etl should do as much of the data processing post-hoc as possible. This allows for better flexibility long term on what analysis we can perform over the data we collect. Also makes backups easier to maintain, since the only critical data for backups is the webpages themselves. Everything else can be recomputed if necessary.
Feature/functionality required: