Here is the beginning of the PR discussed in the meeting. This RP includes:
the possibility to choose in which column the metadata are extracted
create the column if it does not exist
a new UrlProcessor to extract the url in the same format as the other metadata
tests that extract in cascade several types of metadata (in the case 1) where they are extracted in different columns and 2) in the same column): can you all see if the extraction on this toy dataset suits you? :pray:
Ready for review:
Timestamp @cccntu
Website Desc @shanyas10
Entities @manandey (please note that I've addded some mocks to test your entities extraction in our github workflows)
Datasource & Generation Length @chkla (I've pinged you on some part of the code because I've included some changes I've suggested yesterday)
cc @timoschick (if you can have a more deep review for the HtmlProcessor and UrlProcessor part it would be of a great help!)
Here is the beginning of the PR discussed in the meeting. This RP includes:
UrlProcessor
to extract the url in the same format as the other metadataReady for review:
cc @timoschick (if you can have a more deep review for the
HtmlProcessor
andUrlProcessor
part it would be of a great help!)