code-for-venezuela / c4v-py

3 stars 3 forks source link

Define training set schema #49

Open dieko95 opened 3 years ago

dieko95 commented 3 years ago

Problem

We currently haven't defined the flattened dataset's schema that will be consumed by the huggingface transformer.

Proposed Solution

Define the training dataset schema that will be used to train the huggingface transformer.

For example:
- Column names: text, news_title, location, issue, source_type, author, etc...
- Is column Nullable
- Variable type (varchar, int, float, etc..)

Deliverable

readme.md with dataset's schema.

marianelamin commented 3 years ago

From the first PoC with El Pitazo, looks like we can get: title, content, date, author, categories and tags. It would be good to explore on our next sources whether or not they can be extracted also.

In the mean time, for a VP we are counting on just the content of the post. In case of a change on design, it will be notified here.

dieko95 commented 3 years ago

From the first PoC with El Pitazo, looks like we can get: title, content, date, author, categories and tags. It would be good to explore on our next sources whether or not they can be extracted also.

In the mean time, for a VP we are counting on just the content of the post. In case of a change on design, it will be notified here.

@marianelamin Gotcha! Thanks a lot for the update 🙌