Do all data processing in ht-etl

anidata / ht-etl

Anidata 1.0: ETL and algorithm code.

0 stars 10 forks source link

Do all data processing in ht-etl #1

Open bmenn opened 7 years ago

bmenn commented 7 years ago

Instead of doing some of the heavylifting and data processing in anidata/palantiri, anidata/ht-etl should do as much of the data processing post-hoc as possible. This allows for better flexibility long term on what analysis we can perform over the data we collect. Also makes backups easier to maintain, since the only critical data for backups is the webpages themselves. Everything else can be recomputed if necessary.

Feature/functionality required:

[x] Parse phone numbers (in various formats) from raw HTML (see #4)
[x] Parse emails (in various formats) from raw HTML (see #5)
[ ] Parse BackPage OIDs from raw HTML (see #19)
[ ] Include existing entity resolution in pipeline

Engineiro commented 7 years ago

Is the work encapsulated by your issue description the following?

Retrieval of data to be stored (presumably text-based webpage data)
Storage of unmodified source webpage data
Source data transformation
Storage of data elements to be used in ht analytics

bmenn commented 7 years ago

Retrival of data to stored (presumably text-based webpage data

For now, I think this is true. Might change if we encounter a page that is dynamically rendered

Storage of unmodified source webpage data

Source date transformation

Yep

Storage of data elements to be used in ht-analytics

anidata/ht-analytics is probably not going to be the primary consumer of this data for the time being. Data ingestion would come from anidata/palantiri or a future version of it. Data consumption will primarily be by the anidata/ht-archive project.

Engineiro commented 7 years ago

Data ingestion would come from anidata/palantiri or a future version of it

This issue would not encapsulate "Retrieval of data to be stored". The data retrieval would be a concern of palantiri instead. It may need to be modified such that it retrieves source unmodified when crawling the webs, if I understand how it works today.

So this issue would begin with the assumption that the webpage data is available and stored. The work performed begins at data consumption, for purposes of transformation, of the webpage data source and then persistence of the transformed data into another store that would be accessed by anidata/ht-archive. am I closer?

bmenn commented 7 years ago

Good catch. Yes, data retrieval would be the responsibility of palantiri. palantiri does need to changed so it does not do any post-processing. Input for ht-etl would be the unmodified webpage source from palantiri, and the only requirements for output on ht-etl right now would come from ht-archive.

Since no major work is being done on palantiri right now, if you need to make any schema change go ahead with those under anidata/gcloud-infrastructure#2.